Related
Essentially, I need to write a faster implementation as a replacement for insert() to insert an element in a particular position in a list.
The inputs are given in a list as [(index, value), (index, value), (index, value)]
For example: Doing this to insert 10,000 elements in a 1,000,000 element list takes about 2.7 seconds
def do_insertions_simple(l, insertions):
"""Performs the insertions specified into l.
#param l: list in which to do the insertions. Is is not modified.
#param insertions: list of pairs (i, x), indicating that x should
be inserted at position i.
"""
r = list(l)
for i, x in insertions:
r.insert(i, x)
return r
My assignment asks me to speed up the time taken to complete the insertions by 8x or more
My current implementation:
def do_insertions_fast(l, insertions):
"""Implement here a faster version of do_insertions_simple """
#insert insertions[x][i] at l[i]
result=list(l)
for x,y in insertions:
result = result[:x]+list(y)+result[x:]
return result
Sample input:
import string
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
insertions = [(0, 'a'), (2, 'b'), (2, 'b'), (7, 'c')]
r1 = do_insertions_simple(l, insertions)
r2 = do_insertions_fast(l, insertions)
print("r1:", r1)
print("r2:", r2)
assert_equal(r1, r2)
is_correct = False
for _ in range(20):
l, insertions = generate_testing_case(list_len=100, num_insertions=20)
r1 = do_insertions_simple(l, insertions)
r2 = do_insertions_fast(l, insertions)
assert_equal(r1, r2)
is_correct = True
The error I'm getting while running the above code:
r1: ['a', 0, 'b', 'b', 1, 2, 3, 'c', 4, 5, 6, 7, 8, 9]
r2: ['a', 0, 'b', 'b', 1, 2, 3, 'c', 4, 5, 6, 7, 8, 9]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-54e0c44a8801> in <module>()
12 l, insertions = generate_testing_case(list_len=100, num_insertions=20)
13 r1 = do_insertions_simple(l, insertions)
---> 14 r2 = do_insertions_fast(l, insertions)
15 assert_equal(r1, r2)
16 is_correct = True
<ipython-input-7-b421ee7cc58f> in do_insertions_fast(l, insertions)
4 result=list(l)
5 for x,y in insertions:
----> 6 result = result[:x]+list(y)+result[x:]
7 return result
8 #raise NotImplementedError()
TypeError: 'float' object is not iterable
The file is using the nose framework to check my answers, etc, so if there's any functions that you don't recognize, its probably from that framework.
I know that it is inserting the lists right, however it keeps raising the error "float object is not iterable"
I've also tried a different method which did work (sliced the lists, added the element, and added the rest of the list, and then updating the list) but that was 10 times slower than insert()
I'm not sure how to continue
edit: I've been looking at the entire question wrong, for now I'll try to do it myself but if I'm stuck again I'll ask a different question and link that here
From your question, emphasis mine:
I need to write a faster implementation as a replacement for insert() to insert an element in a particular position in a list
You won't be able to. If there was a faster way, then the existing insert() function would already use it. Anything you do will not even get close to the speed.
What you can do is write a faster way to do multiple insertions.
Let's look at an example with two insertions:
>>> a = list(range(15))
>>> a.insert(5, 'X')
>>> a.insert(10, 'Y')
>>> a
[0, 1, 2, 3, 4, 'X', 5, 6, 7, 8, 'Y', 9, 10, 11, 12, 13, 14]
Since every insert shifts all values to the right of it, this in general is an O(m*(n+m)) time algorithm, where n is the original size of the list and m is the number of insertions.
Another way to do it is to build the result piece by piece, taking the insertion points into account:
>>> a = list(range(15))
>>> b = []
>>> b.extend(a[:5])
>>> b.append('X')
>>> b.extend(a[5:9])
>>> b.append('Y')
>>> b.extend(a[9:])
>>> b
[0, 1, 2, 3, 4, 'X', 5, 6, 7, 8, 'Y', 9, 10, 11, 12, 13, 14]
This is O(n+m) time, as all values are just copied once and there's no shifting. It's just somewhat tricky to determine the correct piece lengths, as earlier insertions affect later ones. Especially if the insertion indexes aren't sorted (and in that case it would also take O(m log m) additional time to sort them). That's why I had to use [5:9] and a[9:] instead of [5:10] and a[10:]
(Yes, I know, extend/append internally copy some more if the capacity is exhausted, but if you understand things enough to point that out, then you also understand that it doesn't matter :-)
One option is to use a different data structure, which supports faster insertions.
The obvious suggestion would be a binary tree of some sort. You can insert nodes into a balanced binary tree in O(log n) time, so long as you're able to find the right insertion point in O(log n) time. A solution to that is for each node to store and maintain its own subtree's cardinality; then you can find a node by index without iterating through the whole tree. Another possibility is a skip list, which supports insertion in O(log n) average time.
However, the problem is that you are writing in Python, so you have a major disadvantage trying to write something faster than the built-in list.insert method, because that's implemented in C, and Python code is a lot slower than C code. It's not unusual to write an O(log n) algorithm in Python that only beats the built-in O(n) implementation for very large n, and even n = 1,000,000 may not be large enough to win by a factor of 8 or more. This could mean a lot of wasted effort if you try implementing your own data structure and it turns out not to be fast enough.
I think the expected solution for this assignment will be something like Heap Overflow's answer. That said, there is another way to approach this question which is worth considering because it avoids the complications of working out the correct indices to insert at if you do the insertions out of order. My idea is to take advantage of the efficiency of list.insert but to call it on shorter lists.
If the data is still stored in Python lists, then the list.insert method can still be used to get the efficiency of a C implementation, but if the lists are shorter then the insert method will be faster. Since you only need to win by a constant factor, you can divide the input list into, say, 256 sublists of roughly equal size. Then for each insertion, insert it at the correct index in the correct sublist; and finally join the sublists back together again. The time complexity is O(nm), which is the same as the "naive" solution, but it has a lower constant factor.
To compute the correct insertion index we need to subtract the lengths of the sublists to the left of the one we're inserting in; we can store the cumulative sublist lengths in a prefix sum array, and update this array efficiently using numpy. Here's my implementation:
from itertools import islice, chain, accumulate
import numpy as np
def do_insertions_split(lst, insertions, num_sublists=256):
n = len(lst)
sublist_len = n // num_sublists
lst_iter = iter(lst)
sublists = [list(islice(lst_iter, sublist_len)) for i in range(num_sublists-1)]
sublists.append(list(lst_iter))
lens = [0]
lens.extend(accumulate(len(s) for s in sublists))
lens = np.array(lens)
for idx, val in insertions:
# could use binary search, but num_sublists is small
j = np.argmax(lens >= idx)
sublists[j-1].insert(idx - lens[j-1], val)
lens[j:] += 1
return list(chain.from_iterable(sublists))
It is not as fast as #iz_'s implementation (linked from the comments), but it beats the simple algorithm by a factor of almost 20, which is sufficient according to the problem statement. The times below were measured using timeit on a list of length 1,000,000 with 10,000 insertions.
simple -> 2.1252768037122087 seconds
iz -> 0.041302349785668824 seconds
split -> 0.10893724981304054 seconds
Note that my solution still loses to #iz_'s by a factor of about 2.5. However, #iz_'s solution requires the insertion points to be sorted, whereas mine works even when they are unsorted:
lst = list(range(1_000_000))
insertions = [(randint(0, len(lst)), "x") for _ in range(10_000)]
# uncomment if the insertion points should be sorted
# insertions.sort()
r1 = do_insertions_simple(lst, insertions)
r2 = do_insertions_iz(lst, insertions)
r3 = do_insertions_split(lst, insertions)
if r1 != r2: print('iz failed') # prints
if r1 != r3: print('split failed') # doesn't print
Here is my timing code, in case anyone else wants to compare. I tried a few different values for num_sublists; anything between 200 and 1,000 seemed to be about equally good.
from timeit import timeit
algorithms = {
'simple': do_insertions_simple,
'iz': do_insertions_iz,
'split': do_insertions_split,
}
reps = 10
for name, func in algorithms.items():
t = timeit(lambda: func(lst, insertions), number=reps) / reps
print(name, '->', t, 'seconds')
list(y) attempts to iterate over y and create a list of its elements. If y is an integer, it will not be iterable, and return the error you mentioned. You instead probably want to create a list literal containing y like so: [y]
It is easy to convert an entire iterator sequence into a list using list(iterator), but what is the best/fastest way to directly create a sublist from an iterator without first creating the entire list, i.e. how to best create list(iterator)[m:n] without first creating the entire list?
It seems obvious that it should not* (at least not always) be possible to do so directly for m > 0, but it should be for n less than the length of the sequence. [p for i,p in zip(range(n), iterator)] comes to mind, but is that the best way?
The context is simple: Creating the entire list would cause a RAM overflow, so it needs to be broken down. So how do you do this efficiently and/or python-ic-ly?
*The list comprehension I mentioned could obviously be used for m > 0 by calling next(iterator) m times prior to execution, but I don't enjoy the lack of python-ness here.
itertools.islice:
from itertools import islice
itr = (i for i in range(10))
m, n = 3, 8
result = list(islice(itr, m, n))
print(result)
# [3, 4, 5, 6, 7]
In addition, you can add an argument as the step if you wanted:
itr = (i for i in range(10))
m, n, step = 3, 8, 2
result = list(islice(itr, m, n, step))
print(result)
# [3, 5, 7]
I want to calculate the sum of a collection, for sections of different sizes:
d = (1, 2, 3, 4, 5, 6, 7, 8, 9)
sz = (2, 3, 4)
# here I expect 1+2=3, 3+4+5=12, 6+7+8+9=30
itd = iter(d)
result = tuple( sum(tuple(next(itd) for i in range(s))) for s in sz )
print("result = {}".format(result))
I wonder whether the solution I came up with is the most 'pythonic' (elegant, readable, concise) way to achieve what I want...
In particular, I wonder whether there is a way to get rid of the separate iterator 'itd', and whether it would be easier to work with slices?
I would use itertools.islice since you can directly use the values in sz as the step size at each point:
>>> from itertools import islice
>>> it=iter(d)
>>> [sum(islice(it,s)) for s in sz]
[3, 12, 30]
Then you can convert that to a tuple if needed.
The iter is certainly needed in order to step through the tuple at the point where the last slice left off. Otherwise each slice would be d[0:s]
There's no reason to get rid of your iterator – iterating over d is what you are doing, after all.
You do seem to have an overabundance of tuples in that code, though. The line that's doing all the work could be made more legible by getting rid of them:
it = iter(d)
result = [sum(next(it) for _ in range(s)) for s in sz]
# [3, 12, 30]
… which has the added advantage that now you're producing a list rather than a tuple. d and sz also make more sense as lists, by the way: they're variable-length sequences of homogeneous data, not fixed-length sequences of heterogeneous data.
Note also that it is the conventional name for an arbitrary iterator, and _ is the conventional name for any variable that must exist but is never actually used.
Going a little further, next(it) for _ in range(s) is doing the same work that islice() could do more legibly:
from itertools import islice
it = iter(d)
result = [sum(islice(it, s)) for s in sz]
# [3, 12, 30]
… at which point, I'd say the code's about as elegant, readable and concise as it's likely to get.
I'm trying to identify if a large list has consecutive elements that are the same.
So let's say:
lst = [1, 2, 3, 4, 5, 5, 6]
And in this case, I would return true, since there are two consecutive elements lst[4] and lst[5], are the same value.
I know this could probably be done with some sort of combination of loops, but I was wondering if there were a more efficient way to do this?
You can use itertools.groupby() and a generator expression within any()
*:
>>> from itertools import groupby
>>> any(sum(1 for _ in g) > 1 for _, g in groupby(lst))
True
Or as a more Pythonic way you can use zip(), in order to check if at least there are two equal consecutive items in your list:
>>> any(i==j for i,j in zip(lst, lst[1:])) # In python-2.x,in order to avoid creating a 'list' of all pairs instead of an iterator use itertools.izip()
True
Note: The first approach is good when you want to check if there are more than 2 consecutive equal items, otherwise, in this case the second one takes the cake!
* Using sum(1 for _ in g) instead of len(list(g)) is very optimized in terms of memory use (not reading the whole list in memory at once) but the latter is slightly faster.
You can use a simple any condition:
lst = [1, 2, 3, 4, 5, 5, 6]
any(lst[i]==lst[i+1] for i in range(len(lst)-1))
#outputs:
True
any return True if any of the iterable elements are True
If you're looking for an efficient way of doing this and the lists are numerical, you would probably want to use numpy and apply the diff (difference) function:
>>> numpy.diff([1,2,3,4,5,5,6])
array([1, 1, 1, 1, 0, 1])
Then to get a single result regarding whether there are any consecutive elements:
>>> numpy.any(~numpy.diff([1,2,3,4,5,5,6]).astype(bool))
This first performs the diff, inverts the answer, and then checks if any of the resulting elements are non-zero.
Similarly,
>>> 0 in numpy.diff([1, 2, 3, 4, 5, 5, 6])
also works well and is similar in speed to the np.any approach (credit for this last version to heracho).
Here a more general numpy one-liner:
number = 7
n_consecutive = 3
arr = np.array([3, 3, 6, 5, 8, 7, 7, 7, 4, 5])
# ^ ^ ^
np.any(np.convolve(arr == number, v=np.ones(n_consecutive), mode='valid')
== n_consecutive)[0]
This method always searches the whole array, while the approach from #Kasramvd ends when the condition is first met. So which method is faster dependents on how sparse those cases of consecutive numbers are.
If you are interested in the positions of the consecutive numbers, and have to look at all elements of the array this approach should be faster (for larger arrays (or/and longer sequences)).
idx = np.nonzero(np.convolve(arr==number, v=np.ones(n_consecutive), mode='valid')
== n_consecutive)
# idx = i: all(arr[i:i+n_consecutive] == number)
If you are not interested in a specific value but at all consecutive numbers in general a slight variation of #jmetz's answer:
np.any(np.convolve(np.abs(np.diff(arr)), v=np.ones(n_consecutive-1), mode='valid') == 0)
# ^^^^^^
# EDIT see djvg's comment
Starting in Python 3.10, the new pairwise function provides a way to slide through pairs of consecutive elements, so that we can test the quality between consecutive elements:
from itertools import pairwise
any(x == y for (x, y) in pairwise([1, 2, 3, 4, 5, 5, 6]))
# True
The intermediate result of pairwise:
pairwise([1, 2, 3, 4, 5, 5, 6])
# [(1, 2), (2, 3), (3, 4), (4, 5), (5, 5), (5, 6)]
A simple for loop should do it:
def check(lst):
last = lst[0]
for num in lst[1:]:
if num == last:
return True
last = num
return False
lst = [1, 2, 3, 4, 5, 5, 6]
print (check(lst)) #Prints True
Here, in each loop, I check if the current element is equal to the previous element.
The convolution approach suggested in scleronomic's answer is very promising, especially if you're looking for more than two consecutive elements.
However, the implementation presented in that answer might not be the most efficient, because it consists of two steps: diff() followed by convolve().
Alternative implementation
If we consider that the diff() can also be calculated using convolution, we can combine the two steps into a single convolution.
The following alternative implementation only requires a single convolution of the full signal, which is advantageous if the signal has many elements.
Note that we cannot take the absolute values of the diff (to prevent false positives, as mentioned in this comment), so we add some random noise to the unit kernel instead.
# example signal
signal = numpy.array([1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0])
# minimum number of consecutive elements
n_consecutive = 3
# convolution kernel for weighted moving sum (with small random component)
rng = numpy.random.default_rng()
random_kernel = 1 + 0.01 * rng.random(n_consecutive - 1)
# convolution kernel for first-order difference (similar to numpy.diff)
diff_kernel = [1, -1]
# combine the kernels so we only need to do one convolution with the signal
combined_kernel = numpy.convolve(diff_kernel, random_kernel, mode='full')
# convolve the signal to get the moving weighted sum of differences
moving_sum_of_diffs = numpy.convolve(signal, combined_kernel, mode='valid')
# check if moving sum is zero anywhere
result = numpy.any(moving_sum_of_diffs == 0)
See the DSP guide for a detailed discussion of convolution.
Timing
The difference between the two implementations boils down to this:
def original(signal, unit_kernel):
return numpy.convolve(numpy.abs(numpy.diff(signal)), unit_kernel, mode='valid')
def alternative(signal, combined_kernel):
return numpy.convolve(signal, combined_kernel, mode='valid')
where unit_kernel = numpy.ones(n_consecutive - 1) and combined_kernel is defined above.
Comparison of these two functions, using timeit, shows that alternative() can be several times faster, for small kernel sizes (i.e. small value of n_consecutive). However, for large kernel sizes the advantage becomes negligible, because the convolution becomes dominant (compared to the diff).
Notes:
For large kernel sizes I would prefer the original two-step approach, as I think it is easier to understand.
Due to numerical issues it may be necessary to replace numpy.any(moving_sum_of_diffs == 0) by numpy.any(numpy.abs(moving_sum_of_diffs) < very_small_number), see e.g. here.
My solution for this if you want to find out whether 3 consecutive values are equal to 7. For example, a tuple of intList = (7, 7, 7, 8, 9, 1):
for i in range(len(intList) - 1):
if intList[i] == 7 and intList[i + 2] == 7 and intList[i + 1] == 7:
return True
return False
I have a dataset of ca. 9K lists of variable length (1 to 100K elements). I need to calculate the length of the intersection of all possible 2-list combinations in this dataset. Note that elements in each list are unique so they can be stored as sets in python.
What is the most efficient way to perform this in python?
Edit I forgot to specify that I need to have the ability to match the intersection values to the corresponding pair of lists. Thanks everybody for the prompt response and apologies for the confusion!
If your sets are stored in s, for example:
s = [set([1, 2]), set([1, 3]), set([1, 2, 3]), set([2, 4])]
Then you can use itertools.combinations to take them two by two, and calculate the intersection (note that, as Alex pointed out, combinations is only available since version 2.6). Here with a list comrehension (just for the sake of the example):
from itertools import combinations
[ i[0] & i[1] for i in combinations(s,2) ]
Or, in a loop, which is probably what you need:
for i in combinations(s, 2):
inter = i[0] & i[1]
# processes the intersection set result "inter"
So, to have the length of each one of them, that "processing" would be:
l = len(inter)
This would be quite efficient, since it's using iterators to compute every combinations, and does not prepare all of them in advance.
Edit: Note that with this method, each set in the list "s" can actually be something else that returns a set, like a generator. The list itself could simply be a generator if you are short on memory. It could be much slower though, depending on how you generate these elements, but you wouldn't need to have the whole list of sets in memory at the same time (not that it should be a problem in your case).
For example, if each set is made from a function gen:
def gen(parameter):
while more_sets():
# ... some code to generate the next set 'x'
yield x
with open("results", "wt") as f_results:
for i in combinations(gen("data"), 2):
inter = i[0] & i[1]
f_results.write("%d\n" % len(inter))
Edit 2: How to collect indices (following redrat's comment).
Besides the quick solution I answered in comment, a more efficient way to collect the set indices would be to have a list of (index, set) instead of a list of set.
Example with new format:
s = [(0, set([1, 2])), (1, set([1, 3])), (2, set([1, 2, 3]))]
If you are building this list to calculate the combinations anyway, it should be simple to adapt to your new requirements. The main loop becomes:
with open("results", "wt") as f_results:
for i in combinations(s, 2):
inter = i[0][1] & i[1][1]
f_results.write("length of %d & %d: %d\n" % (i[0][0],i[1][0],len(inter))
In the loop, i[0] and i[1] would be a tuple (index, set), so i[0][1] is the first set, i[0][0] its index.
As you need to produce a (N by N/2) matrix of results, i.e., O(N squared) outputs, no approach can be less than O(N squared) -- in any language, of course. (N is "about 9K" in your question). So, I see nothing intrinsically faster than (a) making the N sets you need, and (b) iterating over them to produce the output -- i.e., the simplest approach. IOW:
def lotsofintersections(manylists):
manysets = [set(x) for x in manylists]
moresets = list(manysets)
for s in reversed(manysets):
moresets.pop()
for z in moresets:
yield s & z
This code's already trying to add some minor optimization (e.g. by avoiding slicing or popping off the front of lists, which might add other O(N squared) factors).
If you have many cores and/or nodes available and are looking for parallel algorithms, it's a different case of course -- if that's your case, can you mention the kind of cluster you have, its size, how nodes and cores can best communicate, and so forth?
Edit: as the OP has casually mentioned in a comment (!) that they actually need the numbers of the sets being intersected (really, why omit such crucial parts of the specs?! at least edit the question to clarify them...), this would only require changing this to:
L = len(manysets)
for i, s in enumerate(reversed(manysets)):
moresets.pop()
for j, z in enumerate(moresets):
yield L - i, j + 1, s & z
(if you need to "count from 1" for the progressive identifiers -- otherwise obvious change).
But if that's part of the specs you might as well use simpler code -- forget moresets, and:
L = len(manysets)
for i xrange(L):
s = manysets[i]
for j in range(i+1, L):
yield i, j, s & manysets[z]
this time assuming you want to "count from 0" instead, just for variety;-)
Try this:
_lists = [[1, 2, 3, 7], [1, 3], [1, 2, 3], [1, 3, 4, 7]]
_sets = map( set, _lists )
_intersection = reduce( set.intersection, _sets )
And to obtain the indexes:
_idxs = [ map(_i.index, _intersection ) for _i in _lists ]
Cheers,
José María García
PS: Sorry I misunderstood the question