Fast membership in slices of lists

Fast membership in slices of lists - python

I have a very large list called 'data' and I need to answer queries equivalent to
if (x in data[a:b]):
for different values of a, b and x.
Is it possible to preprocess data to make these queries fast

idea
you may create a dict. For every element store the sorted list of positions where it occurs.
To answer query: binary search first element that greater or equal a, check if it exists and less than b
Pseudocode
Preprocessing:
from collections import defaultdict
byvalue = defaultdict(list)
for i, x in enumerate(data):
byvalue[x].append(i)
Query:
def has_index_in_slice(indices, a, b):
r = bisect.bisect_left(indices, a)
return r < len(indices) and indices[r] < b
def check(byvalue, x, a, b):
indices = byvalue.get(x, None)
if not indices: return False
return has_index_in_slice(indices, a, b)
Complexity is O(log N) per query here if we suppose that list and dict have O(1) "get by index" complexity.

Yes, you could preprocess those slices into sets, thereby making membership lookup O(1) instead of O(n):
check = set(data[a:b])
if x in check:
# do something
if y in check:
# do something else

Put the list in a database and take advantage of the built in indexing, optimizations and caching. For instance, from the PostgreSQL manual:
Once an index is created, no further intervention is required: the
system will update the index when the table is modified, and it will
use the index in queries when it thinks doing so would be more
efficient than a sequential table scan.
But you could also use sqlite for simplicity (and availability in Python's standard library). From Python's documentation, regarding indexing:
A Row instance serves as a highly optimized row_factory for Connection
objects. It tries to mimic a tuple in most of its features.
It supports mapping access by column name and index, iteration,
representation, equality testing and len().
And elsewhere on that page:
Row provides both index-based and case-insensitive name-based access
to columns with almost no memory overhead. It will probably be better
than your own custom dictionary-based approach or even a db_row based
solution.

Related

How to filter a collection of objects by field value?

How in Python to organize and filter a collection of objects by a field value? I need to filter by being equal to an exact value and by being less than a value.
And how to do it effectively? If I store my objects in a list I need to iterate over a whole list, potentially holding hundreds of thousands of objects.
#dataclass
class Person:
name: str
salary: float
is_boss: bool
# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]
# filtering in O(n), sloooooow
target = 100000
filtered_collection = [x for x in collection if salary < target]
PS: Actually my use case is group by by a certain field, i.e. is_boss and filter by another, i.e. salary. How to do that? Should I user itertools.groupby over sorted lists and make my objects comparable?

If you maintain your list in sorted order (which ideally means few insertions or removals, because mid-list insertion/removal is itself O(n)), you can find the set of Persons below a given salary with the bisect module.
from bisect import bisect
from operator import attrgetter
# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]
collection.sort(key=attrgetter('salary')) # O(n log n) initial sort
# filtering searches in O(log n):
target = 100000
filtered_collection = collection[:bisect(collection, target, key=attrgetter('salary'))]
Note: The key argument to the various bisect module functions is only supported as of 3.10. In prior versions, you'd need to define the rich comparison operators for Person in terms of salary and search for a faked out Person object, or maintain ugly separate sorted lists, one of salary alone, and a parallel list of the Person objects.
For adding individual elements to the collection, you could use bisect's insort function. Or you could just add a bunch of items to the end of the list in bulk and resort it on the same key as before (Python's sorting algorithm, TimSort, gets near O(n) performance when the collection is mostly in order already, so the cost is not as high as you might think).
I'll note that in practice, this sort of scenario (massive data that can be arbitrarily ordered by multiple fields) usually calls for a database; you might consider using sqlite3 (eventually switching to a more production-grade database like MySQL or PostGres if needed), which, with appropriate indexes defined, would let you do O(log n) SELECTs on any indexed field; you could convert to Person objects on extraction for the data you actually need to work with. The B-trees that true DBMS solutions provide get you O(log n) effort for inserts, deletes and selects on the index fields, where Python built-in collection types make you choose; only one of insertion/deletion or searching can be truly O(log n), while the other is O(n).

Arrays have a sort method - All you have to do is create a function that detirmes if an object is greater than another object - let me show you
class Foo:
def __init__(bar):
this.bar = bar
fooArray = [Foo(10),Foo(8),Foo(9)]
def sortFoo(foo):
return foo.bar
fooArray.sort(key=sortFoo)

Is there a better way than a while loop to perform this function?

I was attempting some python exercises and I hit the 5s timeout on one of the tests. The function is pre-populated with the parameters and I am tasked with writing code that is fast enough to run within the max timeframe of 5s.
There are N dishes in a row on a kaiten belt, with the ith dish being of type Di. Some dishes may be of the same type as one another. The N dishes will arrive in front of you, one after another in order, and for each one you'll eat it as long as it isn't the same type as any of the previous K dishes you've eaten. You eat very fast, so you can consume a dish before the next one gets to you. Any dishes you choose not to eat as they pass will be eaten by others.
Determine how many dishes you'll end up eating.
Issue
The code "works" but is not fast enough.
Code
The idea here is to add the D[i] entry if it is not in the pastDishes list (which can be of size K).
from typing import List
# Write any import statements here
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
numDishes=0
pastDishes=[]
i=0
while (i<N):
if(D[i] not in pastDishes):
numDishes+=1
pastDishes.append(D[i])
if len(pastDishes)>K:
pastDishes.pop(0)
i+=1
return numDishes
Is there a more effective way?

After much trial and error, I have finally found a solution that is fast enough to pass the final case in the puzzle you are working on. My previous code was very neat and quick, however, I have finally found a module with a tool that makes this much faster. Its from collections just as deque is, however it is called Counter.
This was my original code:
def getMaximumEatenDishCount(N: int, D: list, K: int) -> int:
numDishes=lastMod=0
pastDishes=[0]*K
for Dval in D:
if Dval in pastDishes:continue
pastDishes[lastMod] = Dval
numDishes,lastMod = numDishes+1,(lastMod+1)%K
return numDishes
I then implemented Counter like so:
from typing import List
# Write any import statements here
from collections import Counter
def getMaximumEatenDishCount(N: int, D: 'list[int]', K: int) -> int:
eatCount=lastMod = 0
pastDishes=[0]*K
eatenCounts = Counter({0:K})
for Dval in D:
if Dval in eatenCounts:continue
eatCount +=1
eatenCounts[Dval] +=1
val = pastDishes[lastMod]
if eatenCounts[val] <= 1: eatenCounts.pop(val)
else: eatenCounts[val] -= 1
pastDishes[lastMod]=Dval
lastMod = (lastMod+1)%K
return eatCount
Which ended up working quite well. I'm sure you can make it less clunky, but this should work fine on its own.
Some Explanation of what i am doing:
Typically while loops are actually marginally faster than a for loop, however since I need to access the value at an index multiple times if i used it, using a for loop I believe is actually better in this situation. You can see i also initialised the list to the max size it needs to be and am writing over the values instead of popping+appending which saves alot of time. Additionally, as pointed out by #outis, another small improvement was made in my code by using the modulo operator in conjunction with the variable which removes the need for an additional if statement. The Counter is essentially a special dict object that holds a hashable as the key and an int as the value. I use the fact that lastMod is an index to what would normally be accesed through list.pop(0) to access the object needed to either remove or decrement in the counter
Note that it is not considered 'pythonic' to assign multiple variable on one line, however I believe it adds a slight performance boost which is why I have done it. This can be argued though, see this post.
If anyone else is interested the problem that we were trying to solve, it can be found here: https://www.facebookrecruiting.com/portal/coding_puzzles/?puzzle=958513514962507

Can we use an appropriate data structure? If so:
Data structures
Seems like an ordered set which you have to shrink to a capacity restriction of K.
To meet that, if exceeds (len(ordered_set) > K) we have to remove the first n items where n = len(ordered_set) - K. Ideally the removal will perform in O(1).
However since removal on a set is in unordered fashion. We first transform it to a list. A list containing unique elements in the order of appearance in their original sequence.
From that ordered list we can then remove the first n elements.
For example: the function lru returns the least-recently-used items for a sequence seq limited by capacity-limit k.
To obtain the length we can simply call len() on that LRU return value:
maximumEatenDishCount = len(lru(seq, k))
See also:
Does Python have an ordered set?
Fastest way to get sorted unique list in python?
Using set for uniqueness (up to Python 3.6)
def lru(seq, k):
return list(set(seq))[:k]
Using dict for uniqueness (since Python 3.6)
Same mechanics as above, but using the preserved insertion order of dicts since 3.7:
using OrderedDict explicitly
from collections import OrderedDict
def lru(seq, k):
return list(OrderedDict.fromkeys(seq).keys())[:k]
using dict factory-method:
def lru(seq, k):
return list(dict.fromkeys(seq).keys())[:k]
using dict-comprehension:
def lru(seq, k):
return list({i:0 for i in seq}.keys())[:k]
See also:
The order of keys in dictionaries
Using ordered dictionary as ordered set
How do you remove duplicates from a list whilst preserving order?
Real Python: OrderedDict vs dict in Python: The Right Tool for the Job

As the problem is an exercise, exact solutions are not included. Instead, strategies are described.
There are at least a couple potential approaches:
Use a data structure that supports fast containment testing (a set in use, if not in name) limited to the K most recently eaten dishes. Fortunately, since dict preserves insertion order in newer Python versions and testing key containment is fast, it will fit the bill. dict requires that keys be hashable, but since the problem uses ints to represent dish types, that requirement is met.
With this approach, the algorithm in the question remains unchanged.
Rather than checking whether the next dish type is any of the last K dishes, check whether the last time the next dish was eaten is within K of the current plate count. If it is, skip the dish. If not, eat the dish (update both the record of when the next dish was last eaten and the current dish count). In terms of data structures, the program will need to keep a record of when any given dish type was last eaten (initialized to -K-1 to ensure that the first time a dish type is encountered it will be eaten; defaultdict can be very useful for this).
With this approach, the algorithm is slightly different. The code ends up being slightly shorter, as there's no shortening of the data structure storing information about the dishes as there is in the original algorithm.
There are two takeaways from the latter approach that might be applied when solving other problems:
More broadly, reframing a problem (such as from "the dish is in the last K dishes eaten" to "the dish was last eaten within K dishes of now") can result in a simpler approach.
Less broadly, sometimes it's more efficient to work with a flipped data structure, swapping keys/indices and values.
Approach & takeaway 2 both remind me of a substring search algorithm (the name escapes me) that uses a table of positions in the needle (the string to search for) of where each character first appears (for characters not in the string, the table has the length of the string); when a mismatch occurs, the algorithm uses the table to align the substring with the mismatching character, then starts checking at the start of the substring. It's not the most efficient string search algorithm, but it's simple and more efficient than the naive algorithm. It's similar to but simpler and less efficient than the skip search algorithm, which uses the positions of every occurrence of each character in the needle.

from typing import List
# Write any import statements here
from collections import deque, Counter
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
q = deque()
cnt = 0
dish_counter = Counter()
for d in D:
if dish_counter[d] == 0:
cnt += 1
q.append(d)
dish_counter[d] += 1
if len(q) == K + 1:
remove = q.popleft()
dish_counter[remove] -= 1
return cnt

What's the most efficient way to perform a multiple match lookup in a python dictionary?

I'm looking to maximally optimize the runtime for this chunk of code:
aDictionary= {"key":["value", "value2", ...
rests = \
list(map((lambda key: Resp(key=key)),
[key for key, values in
aDictionary.items() if (test1 in values or test2 in values)]))
using python3. willing to throw as much memory at it as possible.
considering throwing the two dictionary lookups on separate processes for speedup (does that make sense?). any other optimization ideas welcome
values can definitely be sorted and turned into a set; it is precomputed, very very large.
always len(values) >>>> len(tests), though they're both growing over time
len(tests) grows very very slowly, and has new values for each execution
currently looking at strings (considering doing a string->integer mapping)

For starters, there is no reason to use map when you are already using a list comprehension, so you can remove that, as well as the outer list call:
rests = [Resp(key=key) for key, values in aDictionary.items()
if (test1 in values or test2 in values)]
A second possible optimization might be to turn each list of values into a set. That would take up time initially, but it would change your lookups (in uses) from linear time to constant time. You might need to create a separate helper function for that. Something like:
def anyIn(checking, checkingAgainst):
checkingAgainst = set(checkingAgainst)
for val in checking:
if val in checkingAgainst:
return True
return False
Then you could change the end of your list comprehension to read
...if anyIn([test1, test2], values)]
But again, this would probably only be worth it if you had more than two values you were checking, or if the list of values in values is very long.

If tests are sufficiently numerous, it will surely pay off to switch to set operations:
tests = set([test1, test2, ...])
resps = map(Resp, (k for k, values in dic.items() if not tests.isdisjoint(values)))
# resps this is a lazy iterable, not a list, and it uses a
# generator inside, thus saving the overhead of building
# the inner list.
Turning the dict values into sets would not gain anything as the conversion would be O(N) with N being the added size of all values-lists, while the above disjoint operation will only iterate each values until it encounters a testx with O(1) lookup.
map is possibly more performant compared to a comprehension if you do not have to use lambda, e.g. if key can be used as the first positional argument in Resp's __init__, but certainly not with the lambda! (Python List Comprehension Vs. Map). Otherwise, a generator or comprehension will be better:
resps = (Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values))
#resps = [Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values)]

Python Dict vs List for adding unique element only

In order to achieve an iterable of unique elements, is [2] acceptable?
# [1]
if element not in list:
list.append(element)
# [2]
dict[element] = None # value doesn't matter

Use set as your data structure.
List is not good performance wise, checking if the element is in a list takes linear time. The longer the list the slower it gets.
Set has constant look up time. Dictionary does too, but you don't need key-value pairs, so it's more elegant to do:
s = set()
s.add(element)
than
s = {}
s[element] = None
Plus you get all the nice set operations, like union, intersection, etc. See the documentation.

Optimised Python dictionary / negative index storage

Raised by this question's comments (I can see that this is irrelevant), I am now aware that using dictionaries for data that needs to be queried/accessed regularly is not good, speedwise.
I have a situation of something like this:
someDict = {}
someDict[(-2, -2)] = something
somedict[(3, -10)] = something else
I am storing keys of coordinates to objects that act as arrays of tiles in a game. These are going to be negative at some point, so I can't use a list or some kind of sparse array (I think that's the term?).
Can I either:
Speed up dictionary lookups, so this would not be an issue
Find some kind of container that will support sparse, negative indices?
I would use a list, but then the querying would go from O(log n) to O(n) to find the area at (x, y). (I think my timings are off here too).

Python dictionaries are very very fast, and using a tuple of integers is not going to be a problem. However your use case seems that sometimes you need to do a single-coordinate check and doing that traversing all the dict is of course slow.
Instead of doing a linear search you can however speed up the data structure for the access you need using three dictionaries:
class Grid(object):
def __init__(self):
self.data = {} # (i, j) -> data
self.cols = {} # i -> set of j
self.rows = {} # j -> set of i
def __getitem__(self, ij):
return self.data[ij]
def __setitem__(self, ij, value):
i, j = ij
self.data[ij] = value
try:
self.cols[i].add(j)
except KeyError:
self.cols[i] = set([j])
try:
self.rows[j].add(i)
except KeyError:
self.rows[j] = add([i])
def getRow(self, i):
return [(i, j, data[(i, j)])
for j in self.cols.get(i, [])]
def getCol(self, j):
return [(i, j, data[(i, j)])
for i in self.rows.get(j, [])]
Note that there are many other possible data structures depending on exactly what you are trying to do, how frequent is reading, how frequent is updating, if you query by rectangles, if you look for nearest non-empty cell and so on.

To start off with
Speed up dictionary lookups, so this would not be an issue
Dictionary lookups are pretty fast O(1), but (from your other question) you're not relying on the hash-table lookup of the dictionary, your relying on a linear search of the dictionary's keys.
Find some kind of container that will support sparse, negative indices?
This isn't indexing into the dictionary. A tuple is an immutable object, and you are hashing the tuple as a whole. The dictionary really has no idea of the contents of the keys, just their hash.
I'm going to suggest, as others did, that you restructure your data.
For example, you could create objects that encapsulate the data you need, and arrange them in a binary tree for O(n lg n) searches. You can even go so far as to wrap the entire thing in a class that will give you the nice if foo in Bar: syntax your looking for.
You probably need a couple coordinated structures to accomplish what you want. Here's a simplified example using dicts and sets (tweaking user 6502's suggestion a bit).
# this will be your dict that holds all the data
matrix = {}
# and each of these will be a dict of sets, pointing to coordinates
cols = {}
rows = {}
def add_data(coord, data)
matrix[coord] = data
try:
cols[coord[0]].add(coord)
except KeyError:
# wrap coords in a list to prevent set() from iterating over it
cols[coord[0]] = set([coord])
try:
rows[coord[1]].add(coord)
except KeyError:
rows[coord[1]] = set([coord])
# now you can find all coordinates from a row or column quickly
>>> add_data((2, 7), "foo4")
>>> add_data((2, 5), "foo3")
>>> 2 in cols
True
>>> 5 in rows
True
>>> [matrix[coord] for coord in cols[2]]
['foo4', 'foo3']
Now just wrap that in a class or a module, and you'll be off, and as always, if it's not fast enough profile and test before you guess.

Dictionary lookups are very fast. Searching for part of the key (e.g. all tiles in row x) is what's not fast. You could use a dict of dicts. Rather than a single dict indexed by a 2-tuple, use nested dicts like this:
somedict = {0: {}, 1:{}}
somedict[0][-5] = "thingy"
somedict[1][4] = "bing"
Then if you want all the tiles in a given "row" it's just somedict[0].
You will need some logic to add the secondary dictionaries where necessary and so on. Hint: check out getitem() and setdefault() on the standard dict type, or possibly the collections.defaultdict type.
This approach gives you quick access to all tiles in a given row. It's still slow-ish if you want all the tiles in a given column (though at least you won't need to look through every single cell, just every row). However, if needed, you could get around that by having two dicts of dicts (one in column, row order and the other in row, column order). Updating then becomes twice as much work, which may not matter for a game where most of the tiles are static, but access is very easy in either direction.
If you only need to store numbers and most of your cells will be 0, check out scipy's sparse matrix classes.

One alternative would be to simply shift the index so it's positive.
E.g. if your indices are contiguous like this:
...
-2 -> a
-1 -> c
0 -> d
1 -> e
2 -> f
...
Just do something like LookupArray[Index + MinimumIndex], where MinimumIndex is the absolute value of the smallest index you would use.
That way, if your minimum was say, -50, it would map to 0. -20 would map to 30, and so forth.
Edit:
An alternative would be to use a trick with how you use the indices. Define the following key function
Key(n) = 2 * n (n >= 0)
Key(n) = -2 * n - 1. (n < 0)
This maps all positive keys to the positive even indices, and all negative elements to the positive odd indices. This may not be practical though, since if you add 100 negative keys, you'd have to expand your array by 200.
One other thing to note: If you plan on doing look ups and the number of keys is constant (or very slowly changing), stick with an array. Otherwise, dictionaries aren't bad at all.

Use multi-dimensional lists -- usually implemented as nested objects. You can easily make this handle negative indices with a little arithmetic. It might use a more memory than a dictionary since something has to be put in every possible slot (usually None for empty ones), but access will be done via simple indexing lookup rather than hashing as it would with a dictionary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.