Related
I was asked this in an interview today, and am starting to believe it is not solvable.
Given a sorted array of size n, select k elements in the array, and reshuffle them back into the array, resulting in a new "nk-sorted" array.
Find the k (or less) elements that have moved in that new array.
Here is (Python) code that creates such arrays, but I don't care about language for this.
import numpy as np
def __generate_unsorted_array(size, is_integer=False, max_int_value=100000):
return np.random.randint(max_int_value, size=size) if is_integer else np.random.rand(size)
def generate_nk_unsorted_array(n, k, is_integer=False, max_int_value=100000):
assert k <= n
unsorted_n_array = __generate_unsorted_array(n - k, is_integer, max_int_value=max_int_value)
sorted_n_array = sorted(unsorted_n_array)
random_k_array = __generate_unsorted_array(k, is_integer, max_int_value=max_int_value)
insertion_inds = np.random.choice(n - k + 1, k, replace=True) # can put two unsorted next to each other.
nk_unsorted_array = np.insert(sorted_n_array, insertion_inds, random_k_array)
return list(nk_unsorted_array)
Is this doable under the complexity constraint?
This is only part of the question. The whole question required to sort the "nk-sorted array" in O(n+klogk)
Note: This is a conceptual solution. It is coded in Python, but because of the way Python implements List, does not actually run in the required complexity. See soyuzzzz's answer to see an actual solution in Python in the complexity requirement.
Accepted #soyuzzzz's answer over this one.
Original answer (works, but the complexity is only correct assuming Linked list implementation for Python's List, which is not the case):
This sorts a nk-unsorted array in O(n + klogk), assuming the array should be ascending.
Find elements which are not sorted by traversing the array.
If such an element was found (it is larger then the following one), then either it or the following one are out of order (or both).
Keep both of them aside, and remove them from the array
continue traversing on the newly obtained array (after removal), form the index which comes before the found element.
This will put aside 2k elements in O(n) time.
Sort 2k elements O(klogk)
Merge two sorted lists which have total n elements, O(n)
Total O(n + klogk)
Code:
def merge_sorted_lists(la, lb):
if la is None or la == []:
return lb
if lb is None or lb == []:
return la
a_ind = b_ind = 0
a_len = len(la)
b_len = len(lb)
merged = []
while a_ind < a_len and b_ind < b_len:
a_value = la[a_ind]
b_value = lb[b_ind]
if a_value < b_value:
merged.append(la[a_ind])
a_ind += 1
else:
merged.append(lb[b_ind])
b_ind += 1
# get the leftovers into merged
while a_ind < a_len:
merged.append(la[a_ind])
a_ind += 1
while b_ind < b_len:
merged.append(lb[b_ind])
b_ind += 1
return merged
and
def sort_nk_unsorted_list(nk_unsorted_list):
working_copy = nk_unsorted_list.copy() # just for ease of testing
requires_resorting = []
current_list_length = len(working_copy)
i = 0
while i < current_list_length - 1 and 1 < current_list_length:
if i == -1:
i = 0
first = working_copy[i]
second = working_copy[i + 1]
if second < first:
requires_resorting.append(first)
requires_resorting.append(second)
del working_copy[i + 1]
del working_copy[i]
i -= 2
current_list_length -= 2
i += 1
sorted_2k_elements = sorted(requires_resorting)
sorted_nk_list = merge_sorted_lists(sorted_2k_elements, working_copy)
return sorted_nk_list
Even though #Gulzar's solution is correct, it doesn't actually give us O(n + k * log k).
The problem is in the sort_nk_unsorted_list function. Unfortunately, deleting an arbitrary item from a Python list is not constant time. It's actually O(n). That gives the overall algorithm a complexity of O(n + nk + k * log k)
What we can do to address this is use a different data structure. If you use a doubly-linked list, removing an item from that list is actually O(1). Unfortunately, Python does not come with one by default.
Here's my solution that achieves O(n + k * log k).
The entry-point function to solve the problem:
def sort(my_list):
in_order, out_of_order = separate_in_order_from_out_of_order(my_list)
out_of_order.sort()
return merge(in_order, out_of_order)
The function that separates the in-order elements from the out-of-order elements:
def separate_in_order_from_out_of_order(my_list):
list_dll = DoublyLinkedList.from_list(my_list)
out_of_order = []
current = list_dll.head
while current.next is not None:
if current.value > current.next.value:
out_of_order.append(current.value)
out_of_order.append(current.next.value)
previous = current.prev
current.next.remove()
current.remove()
current = previous
else:
current = current.next
in_order = list_dll.to_list()
return in_order, out_of_order
The function to merge the two separated lists:
def merge(first, second):
"""
Merges two [sorted] lists into a sorted list.
Runtime complexity: O(n)
Space complexity: O(n)
"""
i, j = 0, 0
result = []
while i < len(first) and j < len(second):
if first[i] < second[j]:
result.append(first[i])
i += 1
else:
result.append(second[j])
j += 1
result.extend(first[i:len(first)])
result.extend(second[j:len(second)])
return result
And last, this is the DoublyLinkedList implementation (I used a sentinel node to make things easier):
class DoublyLinkedNode:
def __init__(self, value):
self.value = value
self.next = None
self.prev = None
def remove(self):
if self.prev:
self.prev.next = self.next
if self.next:
self.next.prev = self.prev
class DoublyLinkedList:
def __init__(self, head):
self.head = head
#staticmethod
def from_list(lst):
sentinel = DoublyLinkedNode(-math.inf)
previous = sentinel
for item in lst:
node = DoublyLinkedNode(item)
node.prev = previous
previous.next = node
previous = node
return DoublyLinkedList(sentinel)
def to_list(self):
result = []
current = self.head.next
while current is not None:
result.append(current.value)
current = current.next
return result
And these are the unit tests I used to validate the code:
import unittest
class TestSort(unittest.TestCase):
def test_sort(self):
test_cases = [
# ( input, expected result)
([1, 2, 3, 4, 10, 5, 6], [1, 2, 3, 4, 5, 6, 10]),
([1, 2, 5, 4, 10, 6, 0], [0, 1, 2, 4, 5, 6, 10]),
([1], [1]),
([1, 3, 2], [1, 2, 3]),
([], [])
]
for (test_input, expected) in test_cases:
result = sort(test_input)
self.assertEqual(expected, result)
I would like to sample a 26 dimensional space with say 10 points in every direction. This means that there are in total 10**26 samples, but I'll discard more than 99.9999... %. Using python, this immediately leads to memory errors.
A first naive approach is to use nested loops:
p = list(range(10))
for p1 in p:
for p2 in p:
...
However, Python has an in-built maximum on the amount of nested loops: 20.
A better approach would be to use the numpy.indices command:
import numpy as np
dimensions = (10,)*26
indices = np.indices(*dimensions)
This fails with an "array too big" message because Numpy can't fit all 10**26 indices in memory. Understandable.
My final approach was to use an iterator, hoping this didn't need more memory:
import numpy as np
dimensions = (10,)*26
for index in np.ndindex(*dimensions):
# do something with index
However, this ALSO fails with an "array too big" message, since under the hood Numpy still tries to create a dense array.
Does anybody else have a better approach?
Thanks!
Tom
EDIT: The "array too big" message is probably because 10**26 is larger than the maximum value an Int64 can store. If you could tell Numpy to store the size as an Int128, that might circumvent the ValueError at least. It'll still require almost 20GB to store all the indices as Int64 though ...
So far, this is the solution that I've found:
class IndicesGenerator:
def __init__(self, nbDimensions, nbSamplesPerDimension):
self.nbDimensions = nbDimensions
self.nbSamplesPerDimension = nbSamplesPerDimension
def getNbDimensions(self):
return self.nbDimensions
def getNbSamplesPerDimension(self):
return self.nbSamplesPerDimension
def getIndices(self):
d = self.getNbDimensions()
N = self.getNbSamplesPerDimension()
# create indices
indices = []
prevIndex = None
for i in range(d):
newIndex = Index(maxValue=N-1, prev=prevIndex)
indices.append(newIndex)
prevIndex = newIndex
lastIndex = indices[-1]
while True:
try:
yield list(map(lambda index: index.getValue(), indices))
lastIndex.increment()
except RuntimeError:
break
class Index:
def __init__(self, maxValue, prev=None):
assert prev is None or isinstance(prev, Index)
assert isinstance(maxValue, int)
self.prev = prev
self.value = 0
self.maxValue = maxValue
def getPrevious(self):
return self.prev
def getValue(self):
return self.value
def setValue(self, value):
assert isinstance(value, int)
self.value = value
def getMaximumValue(self):
return self.maxValue
def increment(self):
if self.getValue() == self.getMaximumValue():
# increment previous and set the current one to zero
if self.getPrevious() is None:
# the end is reached, so raise an error
raise RuntimeError
else:
self.setValue(0)
self.getPrevious().increment()
else:
self.setValue(self.getValue()+1)
if __name__ == '__main__':
import time
nbIndices = 0
d = 3
N = 5
start = time.time()
for indices in IndicesGenerator(nbDimensions=d, nbSamplesPerDimension=N).getIndices():
# print(indices)
nbIndices += 1
assert nbIndices == N**d
end = time.time()
print("Nb indices generated: ", nbIndices)
print("Computation time: ", round(end-start,2), "s.")
It's not fast for large dimensions but at least it works without memory errors.
I am trying to find determinant with a nested list representing a two-dimensional matrix. But it is infinitely calling the getMinor() function and continuously deleting from the same list, which should not happend because I am creating new list every time. Below is the code. Also all the functions are defined in a class named 'Matrix()'.
def __init__(self):
self.matrix_list = []
self.no_of_row = 0
self.no_of_col = 0
def getMinor(self, matrix, j):
del matrix[0]
for i in range(len(matrix)):
del matrix[i][j]
m = Matrix()
m.matrix_list = matrix[:]
m.no_of_row = len(m.matrix_list)
#print(m.no_of_row)
print(m.matrix_list)
m.no_of_col = len(m.matrix_list[0])
return m.detMatrix()
def detMatrix(self):
if self.no_of_row == 2 and self.no_of_col == 2:
return self.matrix_list[0][0] * self.matrix_list[1][1] - self.matrix_list[0][1] * self.matrix_list[1][0]
else:
matrix = self.matrix_list[:]
det = 0
for i in range(self.no_of_col):
det += ((-1)**i) * self.matrix_list[0][i] * self.getMinor(matrix, i)
return det
You have two problems. One is alluded to by user2357112 who unfortunately didn't bother to explain. When you use the expression x[:] you get a shallow copy of the list x. Often there is no practical difference between deep and shallow copies; for example if x contains numbers or strings. But in your case the elements of x are lists. Each element of the new list, x[:], will be the same sub-list that was in the original x - not a copy. When you delete one element of those nested lists (del matrix[i][j]), you are therefore deleting some of your original data.
The second problem is that you aren't handling the recursion properly. You create a new variable, matrix, in the function detMatrix. Even if you make a deep copy here, that won't fix the problem. You pass matrix to getMinor, which deletes some data from it. Now in the next step through your for loop, you have messed up the data. You need to make a deep copy inside the function getMinor.
Here is a program that runs, at least. I didn't check your algebra :-)
I will also add that it's very inefficient. The idea of making a copy and then deleting pieces from the copy doesn't make much sense. I didn't address this.
import copy
class Matrix:
def __init__(self):
self.matrix_list = []
self.no_of_row = 0
self.no_of_col = 0
def getMinor(self, matrix_list, j):
print("Entry:", matrix_list)
matrix = copy.deepcopy(matrix_list)
del matrix[0]
for i in range(len(matrix)):
del matrix[i][j]
print("After deletions", matrix_list)
m = Matrix()
m.matrix_list = matrix[:]
m.no_of_row = len(m.matrix_list)
m.no_of_col = len(m.matrix_list[0])
x = m.detMatrix()
print(m.matrix_list, m.no_of_row, m.no_of_col)
return x
def detMatrix(self):
if self.no_of_row == 2 and self.no_of_col == 2:
return self.matrix_list[0][0] * self.matrix_list[1][1] - self.matrix_list[0][1] * self.matrix_list[1][0]
else:
det = 0
for i in range(self.no_of_col):
det += ((-1)**i) * self.matrix_list[0][i] * self.getMinor(self.matrix_list, i)
return det
m = Matrix()
m.matrix_list.append([0.0,1.0,2.0,3.0])
m.matrix_list.append([1.0,2.0,3.0,4.0])
m.matrix_list.append([2.0,3.0,4.0,5.0])
m.matrix_list.append([3.0,5.0,7.0,9.0])
m.no_of_row = 4
m.no_of_col = 4
print(m.detMatrix())
Let's say I define a record array
>>> y=np.zeros(4,dtype=('a4,int32,float64'))
and then I proceed to fill up the 4 records available. Now I get more data, something like
>>> c=('a',7,'24.5')
and I want to add this record to y. I can't figure out a clean way to do it. The best I have seen in np.concatenate(), but that would require turning c into an record array in and of itself. Is there any simple way to tack my tuple c onto y? This seems like it should be really straightforward and widely documented. Apologies if it is. I haven't been able to find it.
You can use numpy.append(), but as you need to convert the new data into a record array also:
import numpy as np
y = np.zeros(4,dtype=('a4,int32,float64'))
y = np.append(y, np.array([("0",7,24.5)], dtype=y.dtype))
Since ndarray can't dynamic change it's size, you need to copy all the data when you want to append some new data. You can create a class that reduce the resize frequency:
import numpy as np
class DynamicRecArray(object):
def __init__(self, dtype):
self.dtype = np.dtype(dtype)
self.length = 0
self.size = 10
self._data = np.empty(self.size, dtype=self.dtype)
def __len__(self):
return self.length
def append(self, rec):
if self.length == self.size:
self.size = int(1.5*self.size)
self._data = np.resize(self._data, self.size)
self._data[self.length] = rec
self.length += 1
def extend(self, recs):
for rec in recs:
self.append(rec)
#property
def data(self):
return self._data[:self.length]
y = DynamicRecArray(('a4,int32,float64'))
y.extend([("xyz", 12, 3.2), ("abc", 100, 0.2)])
y.append(("123", 1000, 0))
print y.data
for i in xrange(100):
y.append((str(i), i, i+0.1))
This is because concatenating numpy arrays is typically avoided as it requires reallocation of contiguous memory space. Size your array with room to spare, and then concatenate in large chunks if needed. This post may be of some help.
This is a follow up to a similar question which asked the best way to write
for item in somelist:
if determine(item):
code_to_remove_item
and it seems the consensus was on something like
somelist[:] = [x for x in somelist if not determine(x)]
However, I think if you are only removing a few items, most of the items are being copied into the same object, and perhaps that is slow. In an answer to another related question, someone suggests:
for item in reversed(somelist):
if determine(item):
somelist.remove(item)
However, here the list.remove will search for the item, which is O(N) in the length of the list. May be we are limited in that the list is represented as an array, rather than a linked list, so removing items will need to move everything after it. However, it is suggested here that collections.dequeue is represented as a doubly linked list. It should then be possible to remove in O(1) while iterating. How would we actually accomplish this?
Update:
I did some time testing as well, with the following code:
import timeit
setup = """
import random
random.seed(1)
b = [(random.random(),random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
"""
listcomp = """
c[:] = [x for x in b if tokeep(x)]
"""
filt = """
c = filter(tokeep, b)
"""
print "list comp = ", timeit.timeit(listcomp,setup, number = 10000)
print "filtering = ", timeit.timeit(filt,setup, number = 10000)
and got:
list comp = 4.01255393028
filtering = 3.59962391853
The list comprehension is the asymptotically optimal solution:
somelist = [x for x in somelist if not determine(x)]
It only makes one pass over the list, so runs in O(n) time. Since you need to call determine() on each object, any algorithm will require at least O(n) operations. The list comprehension does have to do some copying, but it's only copying references to the objects not copying the objects themselves.
Removing items from a list in Python is O(n), so anything with a remove, pop, or del inside the loop will be O(n**2).
Also, in CPython list comprehensions are faster than for loops.
If you need to remove item in O(1) you can use HashMaps
Since list.remove is equivalent to del list[list.index(x)], you could do:
for idx, item in enumerate(somelist):
if determine(item):
del somelist[idx]
But: you should not modify the list while iterating over it. It will bite you, sooner or later. Use filter or list comprehension first, and optimise later.
A deque is optimized for head and tail removal, not for arbitrary removal in the middle. The removal itself is fast, but you still have to traverse the list to the removal point. If you're iterating through the entire length, then the only difference between filtering a deque and filtering a list (using filter or a comprehension) is the overhead of copying, which at worst is a constant multiple; it's still a O(n) operation. Also, note that the objects in the list aren't being copied -- just the references to them. So it's not that much overhead.
It's possible that you could avoid copying like so, but I have no particular reason to believe this is faster than a straightforward list comprehension -- it's probably not:
write_i = 0
for read_i in range(len(L)):
L[write_i] = L[read_i]
if L[read_i] not in ['a', 'c']:
write_i += 1
del L[write_i:]
I took a stab at this. My solution is slower, but requires less memory overhead (i.e. doesn't create a new array). It might even be faster in some circumstances!
This code has been edited since its first posting
I had problems with timeit, I might be doing this wrong.
import timeit
setup = """
import random
random.seed(1)
global b
setup_b = [(random.random(), random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
# define and call to turn into psyco bytecode (if using psyco)
b = setup_b[:]
def listcomp():
c[:] = [x for x in b if tokeep(x)]
listcomp()
b = setup_b[:]
def filt():
c = filter(tokeep, b)
filt()
b = setup_b[:]
def forfilt():
marked = (i for i, x in enumerate(b) if tokeep(x))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfilt()
b = setup_b[:]
def forfiltCheating():
marked = (i for i, x in enumerate(b) if (x[1] > .45) and (x[1] < .5))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfiltCheating()
"""
listcomp = """
b = setup_b[:]
listcomp()
"""
filt = """
b = setup_b[:]
filt()
"""
forfilt = """
b = setup_b[:]
forfilt()
"""
forfiltCheating = '''
b = setup_b[:]
forfiltCheating()
'''
psycosetup = '''
import psyco
psyco.full()
'''
print "list comp = ", timeit.timeit(listcomp, setup, number = 10000)
print "filtering = ", timeit.timeit(filt, setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, setup, number = 10000)
print '\nnow with psyco \n'
print "list comp = ", timeit.timeit(listcomp, psycosetup + setup, number = 10000)
print "filtering = ", timeit.timeit(filt, psycosetup + setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, psycosetup + setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, psycosetup + setup, number = 10000)
And here are the results
list comp = 6.56407690048
filtering = 5.64738512039
forfilter = 7.31555104256
forfiltCheating = 4.8994679451
now with psyco
list comp = 8.0485959053
filtering = 7.79016900063
forfilter = 9.00477004051
forfiltCheating = 4.90830993652
I must be doing something wrong with psyco, because it is actually running slower.
elements are not copied by list comprehension
this took me a while to figure out. See the example code below, to experiment yourself with different approaches
code
You can specify how long a list element takes to copy and how long it takes to evaluate. The time to copy is irrelevant for list comprehension, as it turned out.
import time
import timeit
import numpy as np
def ObjectFactory(time_eval, time_copy):
"""
Creates a class
Parameters
----------
time_eval : float
time to evaluate (True or False, i.e. keep in list or not) an object
time_copy : float
time to (shallow-) copy an object. Used by list comprehension.
Returns
-------
New class with defined copy-evaluate performance
"""
class Object:
def __init__(self, id_, keep):
self.id_ = id_
self._keep = keep
def __repr__(self):
return f"Object({self.id_}, {self.keep})"
#property
def keep(self):
time.sleep(time_eval)
return self._keep
def __copy__(self): # list comprehension does not copy the object
time.sleep(time_copy)
return self.__class__(self.id_, self._keep)
return Object
def remove_items_from_list_list_comprehension(lst):
return [el for el in lst if el.keep]
def remove_items_from_list_new_list(lst):
new_list = []
for el in lst:
if el.keep:
new_list += [el]
return new_list
def remove_items_from_list_new_list_by_ind(lst):
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
return [lst[ee] for ee in new_list_inds]
def remove_items_from_list_del_elements(lst):
"""WARNING: Modifies lst"""
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
for ind in new_list_inds[::-1]:
if not lst[ind].keep:
del lst[ind]
if __name__ == "__main__":
ClassSlowCopy = ObjectFactory(time_eval=0, time_copy=0.1)
ClassSlowEval = ObjectFactory(time_eval=1e-8, time_copy=0)
keep_ratio = .8
n_runs_timeit = int(1e2)
n_elements_list = int(1e2)
lsts_to_tests = dict(
list_slow_copy_remove_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_copy_keep_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_remove_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_keep_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
)
for lbl, lst in lsts_to_tests.items():
print()
for fct in [
remove_items_from_list_list_comprehension,
remove_items_from_list_new_list,
remove_items_from_list_new_list_by_ind,
remove_items_from_list_del_elements,
]:
lst_loc = lst.copy()
t = timeit.timeit(lambda: fct(lst_loc), number=n_runs_timeit)
print(f"{fct.__name__}, {lbl}: {t=}")
output
remove_items_from_list_list_comprehension, list_slow_copy_remove_many: t=0.0064229519994114526
remove_items_from_list_new_list, list_slow_copy_remove_many: t=0.006507338999654166
remove_items_from_list_new_list_by_ind, list_slow_copy_remove_many: t=0.006562008995388169
remove_items_from_list_del_elements, list_slow_copy_remove_many: t=0.0076057760015828535
remove_items_from_list_list_comprehension, list_slow_copy_keep_many: t=0.006243691001145635
remove_items_from_list_new_list, list_slow_copy_keep_many: t=0.007145451003452763
remove_items_from_list_new_list_by_ind, list_slow_copy_keep_many: t=0.007032064997474663
remove_items_from_list_del_elements, list_slow_copy_keep_many: t=0.007690364996960852
remove_items_from_list_list_comprehension, list_slow_eval_remove_many: t=1.2495998149970546
remove_items_from_list_new_list, list_slow_eval_remove_many: t=1.1657221479981672
remove_items_from_list_new_list_by_ind, list_slow_eval_remove_many: t=1.2621939050004585
remove_items_from_list_del_elements, list_slow_eval_remove_many: t=1.4632593330024974
remove_items_from_list_list_comprehension, list_slow_eval_keep_many: t=1.1344162709938246
remove_items_from_list_new_list, list_slow_eval_keep_many: t=1.1323430630000075
remove_items_from_list_new_list_by_ind, list_slow_eval_keep_many: t=1.1354237199993804
remove_items_from_list_del_elements, list_slow_eval_keep_many: t=1.3084568729973398
import collections
list1=collections.deque(list1)
for i in list2:
try:
list1.remove(i)
except:
pass
INSTEAD OF CHECKING IF ELEMENT IS THERE. USING TRY EXCEPT.
I GUESS THIS FASTER