Related
Essentially, I need to write a faster implementation as a replacement for insert() to insert an element in a particular position in a list.
The inputs are given in a list as [(index, value), (index, value), (index, value)]
For example: Doing this to insert 10,000 elements in a 1,000,000 element list takes about 2.7 seconds
def do_insertions_simple(l, insertions):
"""Performs the insertions specified into l.
#param l: list in which to do the insertions. Is is not modified.
#param insertions: list of pairs (i, x), indicating that x should
be inserted at position i.
"""
r = list(l)
for i, x in insertions:
r.insert(i, x)
return r
My assignment asks me to speed up the time taken to complete the insertions by 8x or more
My current implementation:
def do_insertions_fast(l, insertions):
"""Implement here a faster version of do_insertions_simple """
#insert insertions[x][i] at l[i]
result=list(l)
for x,y in insertions:
result = result[:x]+list(y)+result[x:]
return result
Sample input:
import string
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
insertions = [(0, 'a'), (2, 'b'), (2, 'b'), (7, 'c')]
r1 = do_insertions_simple(l, insertions)
r2 = do_insertions_fast(l, insertions)
print("r1:", r1)
print("r2:", r2)
assert_equal(r1, r2)
is_correct = False
for _ in range(20):
l, insertions = generate_testing_case(list_len=100, num_insertions=20)
r1 = do_insertions_simple(l, insertions)
r2 = do_insertions_fast(l, insertions)
assert_equal(r1, r2)
is_correct = True
The error I'm getting while running the above code:
r1: ['a', 0, 'b', 'b', 1, 2, 3, 'c', 4, 5, 6, 7, 8, 9]
r2: ['a', 0, 'b', 'b', 1, 2, 3, 'c', 4, 5, 6, 7, 8, 9]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-54e0c44a8801> in <module>()
12 l, insertions = generate_testing_case(list_len=100, num_insertions=20)
13 r1 = do_insertions_simple(l, insertions)
---> 14 r2 = do_insertions_fast(l, insertions)
15 assert_equal(r1, r2)
16 is_correct = True
<ipython-input-7-b421ee7cc58f> in do_insertions_fast(l, insertions)
4 result=list(l)
5 for x,y in insertions:
----> 6 result = result[:x]+list(y)+result[x:]
7 return result
8 #raise NotImplementedError()
TypeError: 'float' object is not iterable
The file is using the nose framework to check my answers, etc, so if there's any functions that you don't recognize, its probably from that framework.
I know that it is inserting the lists right, however it keeps raising the error "float object is not iterable"
I've also tried a different method which did work (sliced the lists, added the element, and added the rest of the list, and then updating the list) but that was 10 times slower than insert()
I'm not sure how to continue
edit: I've been looking at the entire question wrong, for now I'll try to do it myself but if I'm stuck again I'll ask a different question and link that here
From your question, emphasis mine:
I need to write a faster implementation as a replacement for insert() to insert an element in a particular position in a list
You won't be able to. If there was a faster way, then the existing insert() function would already use it. Anything you do will not even get close to the speed.
What you can do is write a faster way to do multiple insertions.
Let's look at an example with two insertions:
>>> a = list(range(15))
>>> a.insert(5, 'X')
>>> a.insert(10, 'Y')
>>> a
[0, 1, 2, 3, 4, 'X', 5, 6, 7, 8, 'Y', 9, 10, 11, 12, 13, 14]
Since every insert shifts all values to the right of it, this in general is an O(m*(n+m)) time algorithm, where n is the original size of the list and m is the number of insertions.
Another way to do it is to build the result piece by piece, taking the insertion points into account:
>>> a = list(range(15))
>>> b = []
>>> b.extend(a[:5])
>>> b.append('X')
>>> b.extend(a[5:9])
>>> b.append('Y')
>>> b.extend(a[9:])
>>> b
[0, 1, 2, 3, 4, 'X', 5, 6, 7, 8, 'Y', 9, 10, 11, 12, 13, 14]
This is O(n+m) time, as all values are just copied once and there's no shifting. It's just somewhat tricky to determine the correct piece lengths, as earlier insertions affect later ones. Especially if the insertion indexes aren't sorted (and in that case it would also take O(m log m) additional time to sort them). That's why I had to use [5:9] and a[9:] instead of [5:10] and a[10:]
(Yes, I know, extend/append internally copy some more if the capacity is exhausted, but if you understand things enough to point that out, then you also understand that it doesn't matter :-)
One option is to use a different data structure, which supports faster insertions.
The obvious suggestion would be a binary tree of some sort. You can insert nodes into a balanced binary tree in O(log n) time, so long as you're able to find the right insertion point in O(log n) time. A solution to that is for each node to store and maintain its own subtree's cardinality; then you can find a node by index without iterating through the whole tree. Another possibility is a skip list, which supports insertion in O(log n) average time.
However, the problem is that you are writing in Python, so you have a major disadvantage trying to write something faster than the built-in list.insert method, because that's implemented in C, and Python code is a lot slower than C code. It's not unusual to write an O(log n) algorithm in Python that only beats the built-in O(n) implementation for very large n, and even n = 1,000,000 may not be large enough to win by a factor of 8 or more. This could mean a lot of wasted effort if you try implementing your own data structure and it turns out not to be fast enough.
I think the expected solution for this assignment will be something like Heap Overflow's answer. That said, there is another way to approach this question which is worth considering because it avoids the complications of working out the correct indices to insert at if you do the insertions out of order. My idea is to take advantage of the efficiency of list.insert but to call it on shorter lists.
If the data is still stored in Python lists, then the list.insert method can still be used to get the efficiency of a C implementation, but if the lists are shorter then the insert method will be faster. Since you only need to win by a constant factor, you can divide the input list into, say, 256 sublists of roughly equal size. Then for each insertion, insert it at the correct index in the correct sublist; and finally join the sublists back together again. The time complexity is O(nm), which is the same as the "naive" solution, but it has a lower constant factor.
To compute the correct insertion index we need to subtract the lengths of the sublists to the left of the one we're inserting in; we can store the cumulative sublist lengths in a prefix sum array, and update this array efficiently using numpy. Here's my implementation:
from itertools import islice, chain, accumulate
import numpy as np
def do_insertions_split(lst, insertions, num_sublists=256):
n = len(lst)
sublist_len = n // num_sublists
lst_iter = iter(lst)
sublists = [list(islice(lst_iter, sublist_len)) for i in range(num_sublists-1)]
sublists.append(list(lst_iter))
lens = [0]
lens.extend(accumulate(len(s) for s in sublists))
lens = np.array(lens)
for idx, val in insertions:
# could use binary search, but num_sublists is small
j = np.argmax(lens >= idx)
sublists[j-1].insert(idx - lens[j-1], val)
lens[j:] += 1
return list(chain.from_iterable(sublists))
It is not as fast as #iz_'s implementation (linked from the comments), but it beats the simple algorithm by a factor of almost 20, which is sufficient according to the problem statement. The times below were measured using timeit on a list of length 1,000,000 with 10,000 insertions.
simple -> 2.1252768037122087 seconds
iz -> 0.041302349785668824 seconds
split -> 0.10893724981304054 seconds
Note that my solution still loses to #iz_'s by a factor of about 2.5. However, #iz_'s solution requires the insertion points to be sorted, whereas mine works even when they are unsorted:
lst = list(range(1_000_000))
insertions = [(randint(0, len(lst)), "x") for _ in range(10_000)]
# uncomment if the insertion points should be sorted
# insertions.sort()
r1 = do_insertions_simple(lst, insertions)
r2 = do_insertions_iz(lst, insertions)
r3 = do_insertions_split(lst, insertions)
if r1 != r2: print('iz failed') # prints
if r1 != r3: print('split failed') # doesn't print
Here is my timing code, in case anyone else wants to compare. I tried a few different values for num_sublists; anything between 200 and 1,000 seemed to be about equally good.
from timeit import timeit
algorithms = {
'simple': do_insertions_simple,
'iz': do_insertions_iz,
'split': do_insertions_split,
}
reps = 10
for name, func in algorithms.items():
t = timeit(lambda: func(lst, insertions), number=reps) / reps
print(name, '->', t, 'seconds')
list(y) attempts to iterate over y and create a list of its elements. If y is an integer, it will not be iterable, and return the error you mentioned. You instead probably want to create a list literal containing y like so: [y]
It is easy to convert an entire iterator sequence into a list using list(iterator), but what is the best/fastest way to directly create a sublist from an iterator without first creating the entire list, i.e. how to best create list(iterator)[m:n] without first creating the entire list?
It seems obvious that it should not* (at least not always) be possible to do so directly for m > 0, but it should be for n less than the length of the sequence. [p for i,p in zip(range(n), iterator)] comes to mind, but is that the best way?
The context is simple: Creating the entire list would cause a RAM overflow, so it needs to be broken down. So how do you do this efficiently and/or python-ic-ly?
*The list comprehension I mentioned could obviously be used for m > 0 by calling next(iterator) m times prior to execution, but I don't enjoy the lack of python-ness here.
itertools.islice:
from itertools import islice
itr = (i for i in range(10))
m, n = 3, 8
result = list(islice(itr, m, n))
print(result)
# [3, 4, 5, 6, 7]
In addition, you can add an argument as the step if you wanted:
itr = (i for i in range(10))
m, n, step = 3, 8, 2
result = list(islice(itr, m, n, step))
print(result)
# [3, 5, 7]
I have a flat list of numbers that are logically in groups of 3, where each triple is (number, __ignored, flag[0 or 1]), eg:
[7,56,1, 8,0,0, 2,0,0, 6,1,1, 7,2,0, 2,99,1]
I would like to (pythonically) process this list to create a new list of numbers based on the value of 'flag': if 'flag' is 1, then I want 'number', else 0. So the above list would become:
[7, 0, 0, 6, 0, 2]
My initial attempt at doing this is:
list = [7,56,1, 8,0,0, 2,0,0, 6,1,1, 7,2,0, 2,99,1]
numbers = list[::3]
flags = list[2::3]
result = []
for i,number in enumerate(numbers):
result.append(number if flags[i] == 1 else 0)
This works, but it seems to me that there should be a better way to extract tuples cleanly from a list. Something like:
list = [7,56,1, 8,0,0, 2,0,0, 6,1,1, 7,2,0, 2,99,1]
for (number, __, flag) in list:
...etc
But I don't seem to be able to do this.
I could just loop through the entire list:
result = []
for i in range(0, len(list), 3):
result.append(list[i] if list[i+2] == 1 else 0)
Which seems smaller and more efficient.
I am unclear about the best option here. Any advice would be appreciated.
Note: I have accepted the answer by wim:
[L[i]*L[i+2] for i in range(0, len(L), 3)]
But want to reiterate that both wims and ShadowRangers responses work. I accepted wim's answer based on simplicity and clarity (and, to a lesser extent, compatibility with python 2, though ShadowRanger pointed out that zip was in Py2 as well, so this basis is invalid).
The answer by ShadowRanger:
[number if flag else 0 for number, _, flag in zip(*[iter(mylist)]*3)]
also does exactly what I thought I wanted (providing tuples), but is a little obscure and requires zip. As wim noted, ShadowRanger's answer would be very well suited to a stream of data rather than a fixed list.
I would also note that ShadowRanger's answer adds obscure use of zip() (which will become less obscure with time), but adds clarity by the use of named values for the tuple, so it's a bit of a win/lose.
For those struggling to understand zip(*[iter(mylist)]*3)], it creates three copies of the one iterator, which are then used to contruct tuples. Because it's the same iterator, each use advances the iterator, making the tuples exactly as I requested.
For both clarity and generality, I am also somewhat inclined to a modified version of the solution from #ShadowRanger:
i = iter(mylist)
[number if flag else 0 for number, _, flag in zip(i, i, i)]
(which, to me, seems much less obscure).
I think the most direct way would be a simple list comprehension:
>>> [L[i]*L[i+2] for i in range(0, len(L), 3)]
[7, 0, 0, 6, 0, 2]
Or consider numpy, it's powerful for tasks like this:
>>> import numpy as np
>>> a = np.array(L).reshape(-1, 3).T
>>> a
array([[ 7, 8, 2, 6, 7, 2],
[56, 0, 0, 1, 2, 99],
[ 1, 0, 0, 1, 0, 1]])
>>> a[0]*a[2]
array([7, 0, 0, 6, 0, 2])
Since you have logical triples, you can use a little hack with iter, sequence multiplication and zip to accomplish your goal:
result = []
for number, _, flag in zip(*[iter(mylist)]*3):
result.append(number if flag else 0) # flag is only 1 or 0, so no need to compare it
That unpacks the same iterator over mylist as three arguments to zip; since it's the same iterator, zip pulls element 0, 1 and 2 for the first output, then 3, 4 and 5, etc. Your loop then unpacks the three elements to logical names (using _ for the value you don't care about).
A list comprehension could even one-line it to:
result = [number if flag else 0 for number, _, flag in zip(*[iter(mylist)]*3)]
though that's getting a little dense, meaning-wise.
Advantages to this approach are:
It works with any iterable, not just lists
It's performant; the input is traversed exactly once, where any slicing solution would traverse it multiple times
It uses meaningful names, not anonymous magic numbers for index offsets
Downside:
zip(*[iter(mylist)]*3) is a little magical
It will silently omit data if your input turns out to not have a length that's a multiple of three (the partial group at the end is dropped)
Note: In anything resembling production code, don't inline the zip/iter/unpack trick. Use the grouper recipe from the itertools module (or a zip-based variant) and call that:
# Defined somewhere else for common use
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
for number, _, flag in grouper(mylist, 3):
The recipe is tested and predictable, and by giving it a name you make the code using it much more obvious.
I guess this may work:
>>> list_ = [7,56,1, 8,0,0, 2,0,0, 6,1,1, 7,2,0, 2,99,1]
>>> [a if c == 1 else 0 for a,c in zip(list_[::3],list_[2::3])]
...
[7, 0, 0, 6, 0, 2]
I have a sorted list of integers in a list called "black" and I'm looking for an elegant way to get start "s" and end "e" of the longest contiguous subsequence (the original problem had black pixels in a wxh-bitmap and I look for the longest line in a given column x). My solution works but looks ugly:
# blacks is a list of integers generated from a bitmap this way:
# blacks= [y for y in range(h) if bits[y*w+x]==1]
longest=(0,0)
s=blacks[0]
e=s-1
for i in blacks:
if e+1 == i: # Contiguous?
e=i
else:
if e-s > longest[1]-longest[0]:
longest = (s,e)
s=e=i
if e-s > longest[1]-longest[0]:
longest = (s,e)
print longest
I feel that this could be done in a smart one or two-liner
You could do the following, using itertools.groupby and itertools.chain:
from itertools import groupby, chain
l = [1, 2, 5, 6, 7, 8, 10, 11, 12]
f = lambda x: x[1] - x[0] == 1 # key function to identify proper neighbours
The following is still almost readable ;-) and gets you a decent intermediate step from which to proceed in a more sensible manner would probably be a valid option:
max((list(g) for k, g in groupby(zip(l, l[1:]), key=f) if k), key=len)
# [(5, 6), (6, 7), (7, 8)]
In order to extract the actaul desired sequence [5, 6, 7, 8] in one line, you have to use some more kung-fu:
sorted(set(chain(*max((list(g) for k, g in groupby(zip(l, l[1:]), key=f) if k), key=len))))
# [5, 6, 7, 8]
I shall leave it to you to work out the internals of this monstrosity :-) but keep in mind: a one-liner is often satisfying in the short run, but long-term, better opt for readability and code that you and your co-workers will understand. And readability is a big part of the Pythonicity you allude to.
Also note that this is O(log_N) because of the sorting. You can achieve the same by applying one of the O(N) duplicate removal techniques involving e.g. an OrderedDict to the output of chain and keep it O(N), but that one line would get even longer.
Update:
One of the O(N) ways to do it is DanD.'s suggestion which can be utilised in a single line using the comprehension trick to avoid assigning an intermediate result to a variable:
list(range(*[(x[0][0], x[-1][1]+1) for x in [max((list(g) for k, g in groupby(zip(l, l[1:]), key=f) if k), key=len)]][0]))
# [5, 6, 7, 8]
Prettier, however, it is not :D
[Edit]
From the feedback/answers I have received, I gather there is some confusion regarding the original question. Consequently, I have reduced the problem to its most rudimentary form
Here are the relevant facts of the problem:
I have a sorted sequence: S
I have an item (denoted by i) that is GUARANTEED to be contained in S
I want a find() algorithm that returns an iterator (iter) that points to i
After obtaining the iterator, I want to be able to iterate FORWARD (BACKWARD?) over the elements in S, starting FROM (and including) i
For my fellow C++ programmers who can also program in Python, what I am asking for, is the equivalent of:
const_iterator std::find (const key_type& x ) const;
The iterator returned can then be used to iterate the sequence. I am just trying to find (pun unintended), if there is a similar inbuilt algorithm in Python, to save me having to reinvent the wheel.
Given your relevant facts:
>>> import bisect
>>> def find_fwd_iter(S, i):
... j = bisect.bisect_left(S, i)
... for k in xrange(j, len(S)):
... yield S[k]
...
>>> def find_bkwd_iter(S, i):
... j = bisect.bisect_left(S, i)
... for k in xrange(j, -1, -1):
... yield S[k]
...
>>> L = [100, 150, 200, 300, 400]
>>> list(find_fwd_iter(L, 200))
[200, 300, 400]
>>> list(find_bkwd_iter(L, 200))
[200, 150, 100]
>>>
yes , you can do like this:
import itertools
from datetime import datetime
data = {
"2008-11-10 17:53:59":"data",
"2005-11-10 17:53:59":"data",
}
list_ = data.keys()
new_list = [datetime.strptime(x, "%Y-%m-%d %H:%M:%S") for x in list_]
begin_date = datetime.strptime("2007-11-10 17:53:59", "%Y-%m-%d %H:%M:%S")
for i in itertools.ifilter(lambda x: x > begin_date, new_list):
print i
If you know for a fact that the items in your sequence are sorted, you can just use a generator expression:
(item for item in seq if item >= 5)
This returns a generator; it doesn't actually traverse the list until you iterate over it, i.e.:
for item in (item for item in seq if item > 5)
print item
will only traverse seq once.
Using a generator expression like this is pretty much identical to using itertools.ifilter, which produces a generator that iterates over the list returning only values that meet the filter criterion:
>>> import itertools
>>> seq = [1, 2, 3, 4, 5, 6, 7]
>>> list(itertools.ifilter(lambda x: x>=3, seq))
[3, 4, 5, 6, 7]
I'm not sure why (except for backwards compatibility) we need itertools.ifilter anymore now that we have generator expressions, but other methods in itertools are invaluable.
If, for instance, you don't know that your sequence is sorted, and you still want to return everything in the sequence from a known item and beyond, you can't use a generator expression. Instead, use itertools.dropwhile. This produces a generator that iterates over the list skipping values until it finds one that meets the filter criterion:
>>> seq = [1, 2, 4, 3, 5, 6, 7]
>>> list(itertools.dropwhile(lambda x: x != 3, seq))
[3, 5, 6, 7]
As far as searching backwards goes, this will only work if the sequence you're using is actually a sequence (like a list, i.e. something that has an end and can be navigated backwards) and not just any iterable (e.g. a generator that returns the next prime number). To do this, use the reversed function, e.g.:
(item for item in reversed(seq) if item >= 5)
One simpler way (albeit slower) would be to use filter and filter for keys before/after that date. Filter has to process each element in the list as opposed to slicing not needing to.
You can do
def on_or_after(date):
from itertools import dropwhile
sorted_items = sorted(date_dictionary.iteritems())
def before_date(pair):
return pair[0] < date
on_or_after_date = dropwhile(before_date, sorted_items)
which I think is about as efficient as it's going to get if you're just doing one such lookup on each sorted collection. on_or_after_date will iterate (date, value) pairs.
Another option would be to build a dictionary as a separate index into the sorted list:
sorted_items = sorted(date_dictionary.iteritems())
date_index = dict((key, i) for i, key in enumerate(sorted_items.keys()))
and then get the items on or after a date with
def on_or_after(date):
return sorted_items[date_index[date]:]
This second approach will be faster if you're going to be doing a lot of lookups on the same series of sorted dates (which it sounds like you are).
If you want really speedy slicing of the sorted dates, you might see some improvement by storing it in a tuple instead of a list. I could be wrong about that though.
note the above code is untested, let me know if it doesn't work and you can't sort out why.
First off, this question isn't related to dicts. You're operating on a sorted list. You're using the results on a dict, but that's not relevant to the question.
You want the bisect module, which implements binary searching. Starting from your code:
import bisect
mydict = {
"2001-01-01":"data1",
"2005-01-02":"data2",
"2002-01-01":"data3",
"2004-01-02":"data4",
}
# ['2001-01-01', '2002-01-01', '2004-01-02', '2005-01-02']:
sorted_dates = sorted(mydict)
# Iterates over 2002-01-01, 2004-01-02 and 2005-01-02:
offset = bisect.bisect_left(sorted_dates, "2002-01-01")
for item in sorted_dates[offset:]:
print item