Optimised Python dictionary / negative index storage

Optimised Python dictionary / negative index storage - python

Raised by this question's comments (I can see that this is irrelevant), I am now aware that using dictionaries for data that needs to be queried/accessed regularly is not good, speedwise.
I have a situation of something like this:
someDict = {}
someDict[(-2, -2)] = something
somedict[(3, -10)] = something else
I am storing keys of coordinates to objects that act as arrays of tiles in a game. These are going to be negative at some point, so I can't use a list or some kind of sparse array (I think that's the term?).
Can I either:
Speed up dictionary lookups, so this would not be an issue
Find some kind of container that will support sparse, negative indices?
I would use a list, but then the querying would go from O(log n) to O(n) to find the area at (x, y). (I think my timings are off here too).

Python dictionaries are very very fast, and using a tuple of integers is not going to be a problem. However your use case seems that sometimes you need to do a single-coordinate check and doing that traversing all the dict is of course slow.
Instead of doing a linear search you can however speed up the data structure for the access you need using three dictionaries:
class Grid(object):
def __init__(self):
self.data = {} # (i, j) -> data
self.cols = {} # i -> set of j
self.rows = {} # j -> set of i
def __getitem__(self, ij):
return self.data[ij]
def __setitem__(self, ij, value):
i, j = ij
self.data[ij] = value
try:
self.cols[i].add(j)
except KeyError:
self.cols[i] = set([j])
try:
self.rows[j].add(i)
except KeyError:
self.rows[j] = add([i])
def getRow(self, i):
return [(i, j, data[(i, j)])
for j in self.cols.get(i, [])]
def getCol(self, j):
return [(i, j, data[(i, j)])
for i in self.rows.get(j, [])]
Note that there are many other possible data structures depending on exactly what you are trying to do, how frequent is reading, how frequent is updating, if you query by rectangles, if you look for nearest non-empty cell and so on.

To start off with
Speed up dictionary lookups, so this would not be an issue
Dictionary lookups are pretty fast O(1), but (from your other question) you're not relying on the hash-table lookup of the dictionary, your relying on a linear search of the dictionary's keys.
Find some kind of container that will support sparse, negative indices?
This isn't indexing into the dictionary. A tuple is an immutable object, and you are hashing the tuple as a whole. The dictionary really has no idea of the contents of the keys, just their hash.
I'm going to suggest, as others did, that you restructure your data.
For example, you could create objects that encapsulate the data you need, and arrange them in a binary tree for O(n lg n) searches. You can even go so far as to wrap the entire thing in a class that will give you the nice if foo in Bar: syntax your looking for.
You probably need a couple coordinated structures to accomplish what you want. Here's a simplified example using dicts and sets (tweaking user 6502's suggestion a bit).
# this will be your dict that holds all the data
matrix = {}
# and each of these will be a dict of sets, pointing to coordinates
cols = {}
rows = {}
def add_data(coord, data)
matrix[coord] = data
try:
cols[coord[0]].add(coord)
except KeyError:
# wrap coords in a list to prevent set() from iterating over it
cols[coord[0]] = set([coord])
try:
rows[coord[1]].add(coord)
except KeyError:
rows[coord[1]] = set([coord])
# now you can find all coordinates from a row or column quickly
>>> add_data((2, 7), "foo4")
>>> add_data((2, 5), "foo3")
>>> 2 in cols
True
>>> 5 in rows
True
>>> [matrix[coord] for coord in cols[2]]
['foo4', 'foo3']
Now just wrap that in a class or a module, and you'll be off, and as always, if it's not fast enough profile and test before you guess.

Dictionary lookups are very fast. Searching for part of the key (e.g. all tiles in row x) is what's not fast. You could use a dict of dicts. Rather than a single dict indexed by a 2-tuple, use nested dicts like this:
somedict = {0: {}, 1:{}}
somedict[0][-5] = "thingy"
somedict[1][4] = "bing"
Then if you want all the tiles in a given "row" it's just somedict[0].
You will need some logic to add the secondary dictionaries where necessary and so on. Hint: check out getitem() and setdefault() on the standard dict type, or possibly the collections.defaultdict type.
This approach gives you quick access to all tiles in a given row. It's still slow-ish if you want all the tiles in a given column (though at least you won't need to look through every single cell, just every row). However, if needed, you could get around that by having two dicts of dicts (one in column, row order and the other in row, column order). Updating then becomes twice as much work, which may not matter for a game where most of the tiles are static, but access is very easy in either direction.
If you only need to store numbers and most of your cells will be 0, check out scipy's sparse matrix classes.

One alternative would be to simply shift the index so it's positive.
E.g. if your indices are contiguous like this:
...
-2 -> a
-1 -> c
0 -> d
1 -> e
2 -> f
...
Just do something like LookupArray[Index + MinimumIndex], where MinimumIndex is the absolute value of the smallest index you would use.
That way, if your minimum was say, -50, it would map to 0. -20 would map to 30, and so forth.
Edit:
An alternative would be to use a trick with how you use the indices. Define the following key function
Key(n) = 2 * n (n >= 0)
Key(n) = -2 * n - 1. (n < 0)
This maps all positive keys to the positive even indices, and all negative elements to the positive odd indices. This may not be practical though, since if you add 100 negative keys, you'd have to expand your array by 200.
One other thing to note: If you plan on doing look ups and the number of keys is constant (or very slowly changing), stick with an array. Otherwise, dictionaries aren't bad at all.

Use multi-dimensional lists -- usually implemented as nested objects. You can easily make this handle negative indices with a little arithmetic. It might use a more memory than a dictionary since something has to be put in every possible slot (usually None for empty ones), but access will be done via simple indexing lookup rather than hashing as it would with a dictionary.

Related

Is there a better way than a while loop to perform this function?

I was attempting some python exercises and I hit the 5s timeout on one of the tests. The function is pre-populated with the parameters and I am tasked with writing code that is fast enough to run within the max timeframe of 5s.
There are N dishes in a row on a kaiten belt, with the ith dish being of type Di. Some dishes may be of the same type as one another. The N dishes will arrive in front of you, one after another in order, and for each one you'll eat it as long as it isn't the same type as any of the previous K dishes you've eaten. You eat very fast, so you can consume a dish before the next one gets to you. Any dishes you choose not to eat as they pass will be eaten by others.
Determine how many dishes you'll end up eating.
Issue
The code "works" but is not fast enough.
Code
The idea here is to add the D[i] entry if it is not in the pastDishes list (which can be of size K).
from typing import List
# Write any import statements here
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
numDishes=0
pastDishes=[]
i=0
while (i<N):
if(D[i] not in pastDishes):
numDishes+=1
pastDishes.append(D[i])
if len(pastDishes)>K:
pastDishes.pop(0)
i+=1
return numDishes
Is there a more effective way?

After much trial and error, I have finally found a solution that is fast enough to pass the final case in the puzzle you are working on. My previous code was very neat and quick, however, I have finally found a module with a tool that makes this much faster. Its from collections just as deque is, however it is called Counter.
This was my original code:
def getMaximumEatenDishCount(N: int, D: list, K: int) -> int:
numDishes=lastMod=0
pastDishes=[0]*K
for Dval in D:
if Dval in pastDishes:continue
pastDishes[lastMod] = Dval
numDishes,lastMod = numDishes+1,(lastMod+1)%K
return numDishes
I then implemented Counter like so:
from typing import List
# Write any import statements here
from collections import Counter
def getMaximumEatenDishCount(N: int, D: 'list[int]', K: int) -> int:
eatCount=lastMod = 0
pastDishes=[0]*K
eatenCounts = Counter({0:K})
for Dval in D:
if Dval in eatenCounts:continue
eatCount +=1
eatenCounts[Dval] +=1
val = pastDishes[lastMod]
if eatenCounts[val] <= 1: eatenCounts.pop(val)
else: eatenCounts[val] -= 1
pastDishes[lastMod]=Dval
lastMod = (lastMod+1)%K
return eatCount
Which ended up working quite well. I'm sure you can make it less clunky, but this should work fine on its own.
Some Explanation of what i am doing:
Typically while loops are actually marginally faster than a for loop, however since I need to access the value at an index multiple times if i used it, using a for loop I believe is actually better in this situation. You can see i also initialised the list to the max size it needs to be and am writing over the values instead of popping+appending which saves alot of time. Additionally, as pointed out by #outis, another small improvement was made in my code by using the modulo operator in conjunction with the variable which removes the need for an additional if statement. The Counter is essentially a special dict object that holds a hashable as the key and an int as the value. I use the fact that lastMod is an index to what would normally be accesed through list.pop(0) to access the object needed to either remove or decrement in the counter
Note that it is not considered 'pythonic' to assign multiple variable on one line, however I believe it adds a slight performance boost which is why I have done it. This can be argued though, see this post.
If anyone else is interested the problem that we were trying to solve, it can be found here: https://www.facebookrecruiting.com/portal/coding_puzzles/?puzzle=958513514962507

Can we use an appropriate data structure? If so:
Data structures
Seems like an ordered set which you have to shrink to a capacity restriction of K.
To meet that, if exceeds (len(ordered_set) > K) we have to remove the first n items where n = len(ordered_set) - K. Ideally the removal will perform in O(1).
However since removal on a set is in unordered fashion. We first transform it to a list. A list containing unique elements in the order of appearance in their original sequence.
From that ordered list we can then remove the first n elements.
For example: the function lru returns the least-recently-used items for a sequence seq limited by capacity-limit k.
To obtain the length we can simply call len() on that LRU return value:
maximumEatenDishCount = len(lru(seq, k))
See also:
Does Python have an ordered set?
Fastest way to get sorted unique list in python?
Using set for uniqueness (up to Python 3.6)
def lru(seq, k):
return list(set(seq))[:k]
Using dict for uniqueness (since Python 3.6)
Same mechanics as above, but using the preserved insertion order of dicts since 3.7:
using OrderedDict explicitly
from collections import OrderedDict
def lru(seq, k):
return list(OrderedDict.fromkeys(seq).keys())[:k]
using dict factory-method:
def lru(seq, k):
return list(dict.fromkeys(seq).keys())[:k]
using dict-comprehension:
def lru(seq, k):
return list({i:0 for i in seq}.keys())[:k]
See also:
The order of keys in dictionaries
Using ordered dictionary as ordered set
How do you remove duplicates from a list whilst preserving order?
Real Python: OrderedDict vs dict in Python: The Right Tool for the Job

As the problem is an exercise, exact solutions are not included. Instead, strategies are described.
There are at least a couple potential approaches:
Use a data structure that supports fast containment testing (a set in use, if not in name) limited to the K most recently eaten dishes. Fortunately, since dict preserves insertion order in newer Python versions and testing key containment is fast, it will fit the bill. dict requires that keys be hashable, but since the problem uses ints to represent dish types, that requirement is met.
With this approach, the algorithm in the question remains unchanged.
Rather than checking whether the next dish type is any of the last K dishes, check whether the last time the next dish was eaten is within K of the current plate count. If it is, skip the dish. If not, eat the dish (update both the record of when the next dish was last eaten and the current dish count). In terms of data structures, the program will need to keep a record of when any given dish type was last eaten (initialized to -K-1 to ensure that the first time a dish type is encountered it will be eaten; defaultdict can be very useful for this).
With this approach, the algorithm is slightly different. The code ends up being slightly shorter, as there's no shortening of the data structure storing information about the dishes as there is in the original algorithm.
There are two takeaways from the latter approach that might be applied when solving other problems:
More broadly, reframing a problem (such as from "the dish is in the last K dishes eaten" to "the dish was last eaten within K dishes of now") can result in a simpler approach.
Less broadly, sometimes it's more efficient to work with a flipped data structure, swapping keys/indices and values.
Approach & takeaway 2 both remind me of a substring search algorithm (the name escapes me) that uses a table of positions in the needle (the string to search for) of where each character first appears (for characters not in the string, the table has the length of the string); when a mismatch occurs, the algorithm uses the table to align the substring with the mismatching character, then starts checking at the start of the substring. It's not the most efficient string search algorithm, but it's simple and more efficient than the naive algorithm. It's similar to but simpler and less efficient than the skip search algorithm, which uses the positions of every occurrence of each character in the needle.

from typing import List
# Write any import statements here
from collections import deque, Counter
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
q = deque()
cnt = 0
dish_counter = Counter()
for d in D:
if dish_counter[d] == 0:
cnt += 1
q.append(d)
dish_counter[d] += 1
if len(q) == K + 1:
remove = q.popleft()
dish_counter[remove] -= 1
return cnt

How to implement dicts / sets opposed to a list search, to increase speed

I am making a program that has to search through very long lists, and I have seen people suggesting that using sets and dicts speeds it up massively. However, I am at a loss as to how to make it work within my code. Currently, the program does this:
indexes = []
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted.index(term))
The code searches through the hex_crypted list, which contains 1,000,000+ terms, finds the index of the term, and then appends it to the the 'indexes' list.
I simply need to speed this process. Thanks for any help.

You want to build a lookup table so you don't need to repeatedly loop over hex_crypted. Then you can simply look up each term in the table.
print("Collecting indexes...")
lookup = {term: index for (index, term) in enumerate(hex_crypted)}
indexes = [lookup[term] for term in sliced_5]

The fastest method if you have a list is to do a set function on the list to return it as a set, but I don't think that is what you want to do in this case.
hex_crypted_set = set(hex_crypted)
If you need to keep that index for some reason, you'll want to instead build a dictionary first.
hex_crypted_dict = {}
for i in enumerate(hex_crypted):
hex_crypted_dict[i[1]] = i[0]
And then to get that index you just search the dict:
indexes = []
for term in sliced_5:
indexes.append(hex_crypted_dict[term])
You will end up with the appropriate indexes which correspond to the original long list and only iterate that long list one time, which will be a lot better performance than iterating it for every time you do a lookup.

The first step is to generate a dict, for example:
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
Then your code changed to
indexes = []
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted_dict[term])

What's the most efficient way to perform a multiple match lookup in a python dictionary?

I'm looking to maximally optimize the runtime for this chunk of code:
aDictionary= {"key":["value", "value2", ...
rests = \
list(map((lambda key: Resp(key=key)),
[key for key, values in
aDictionary.items() if (test1 in values or test2 in values)]))
using python3. willing to throw as much memory at it as possible.
considering throwing the two dictionary lookups on separate processes for speedup (does that make sense?). any other optimization ideas welcome
values can definitely be sorted and turned into a set; it is precomputed, very very large.
always len(values) >>>> len(tests), though they're both growing over time
len(tests) grows very very slowly, and has new values for each execution
currently looking at strings (considering doing a string->integer mapping)

For starters, there is no reason to use map when you are already using a list comprehension, so you can remove that, as well as the outer list call:
rests = [Resp(key=key) for key, values in aDictionary.items()
if (test1 in values or test2 in values)]
A second possible optimization might be to turn each list of values into a set. That would take up time initially, but it would change your lookups (in uses) from linear time to constant time. You might need to create a separate helper function for that. Something like:
def anyIn(checking, checkingAgainst):
checkingAgainst = set(checkingAgainst)
for val in checking:
if val in checkingAgainst:
return True
return False
Then you could change the end of your list comprehension to read
...if anyIn([test1, test2], values)]
But again, this would probably only be worth it if you had more than two values you were checking, or if the list of values in values is very long.

If tests are sufficiently numerous, it will surely pay off to switch to set operations:
tests = set([test1, test2, ...])
resps = map(Resp, (k for k, values in dic.items() if not tests.isdisjoint(values)))
# resps this is a lazy iterable, not a list, and it uses a
# generator inside, thus saving the overhead of building
# the inner list.
Turning the dict values into sets would not gain anything as the conversion would be O(N) with N being the added size of all values-lists, while the above disjoint operation will only iterate each values until it encounters a testx with O(1) lookup.
map is possibly more performant compared to a comprehension if you do not have to use lambda, e.g. if key can be used as the first positional argument in Resp's __init__, but certainly not with the lambda! (Python List Comprehension Vs. Map). Otherwise, a generator or comprehension will be better:
resps = (Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values))
#resps = [Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values)]

Fast membership in slices of lists

I have a very large list called 'data' and I need to answer queries equivalent to
if (x in data[a:b]):
for different values of a, b and x.
Is it possible to preprocess data to make these queries fast

idea
you may create a dict. For every element store the sorted list of positions where it occurs.
To answer query: binary search first element that greater or equal a, check if it exists and less than b
Pseudocode
Preprocessing:
from collections import defaultdict
byvalue = defaultdict(list)
for i, x in enumerate(data):
byvalue[x].append(i)
Query:
def has_index_in_slice(indices, a, b):
r = bisect.bisect_left(indices, a)
return r < len(indices) and indices[r] < b
def check(byvalue, x, a, b):
indices = byvalue.get(x, None)
if not indices: return False
return has_index_in_slice(indices, a, b)
Complexity is O(log N) per query here if we suppose that list and dict have O(1) "get by index" complexity.

Yes, you could preprocess those slices into sets, thereby making membership lookup O(1) instead of O(n):
check = set(data[a:b])
if x in check:
# do something
if y in check:
# do something else

Put the list in a database and take advantage of the built in indexing, optimizations and caching. For instance, from the PostgreSQL manual:
Once an index is created, no further intervention is required: the
system will update the index when the table is modified, and it will
use the index in queries when it thinks doing so would be more
efficient than a sequential table scan.
But you could also use sqlite for simplicity (and availability in Python's standard library). From Python's documentation, regarding indexing:
A Row instance serves as a highly optimized row_factory for Connection
objects. It tries to mimic a tuple in most of its features.
It supports mapping access by column name and index, iteration,
representation, equality testing and len().
And elsewhere on that page:
Row provides both index-based and case-insensitive name-based access
to columns with almost no memory overhead. It will probably be better
than your own custom dictionary-based approach or even a db_row based
solution.

Python: remove lots of items from a list

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.
I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.
So, I might have
myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}
I want to remove each (v, k) tuple from the list.
Currently I am doing:
for k, v in myDict.iteritems():
myList.remove((v, k))
Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.
Is there a better way to do this?
I can provide code used to test, plus pickled data from the actual application if needed.

You'll have to measure, but I can imagine this to be more performant:
myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)
because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.
Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."

To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:
totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]
The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).
I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!
If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:
sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]
or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):
sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]
In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).

Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.
Have you tried doing the "inverse" operation of:
newMyList = [(v,k) for (v,k) in myList if not k in myDict]
But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.
Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.

[(i, j) for i, j in myList if myDict.get(j) != i]

Try something like this:
myListSet = set(myList)
myDictSet = set(zip(myDict.values(), myDict.keys()))
myList = list(myListSet - myDictSet)
This will convert myList to a set, will swap the keys/values in myDict and put them into a set, and will then find the difference, turn it back into a list, and assign it back to myList. :)

The problem looks to me to be the fact you are using a list as the container you are trying to remove from, and it is a totally unordered type. So to find each item in the list is a linear operation (O(n)), it has to iterate over the whole list until it finds a match.
If you could swap the list for some other container (set?) which uses a hash() of each item to order them, then each match could be performed much quicker.
The following code shows how you could do this using a combination of ideas offered by myself and Nick on this thread:
list_set = set(original_list)
dict_set = set(zip(original_dict.values(), original_dict.keys()))
difference_set = list(list_set - dict_set)
final_list = []
for item in original_list:
if item in difference_set:
final_list.append(item)

[i for i in myList if i not in list(zip(myDict.values(), myDict.keys()))]

A list containing a million 2-tuples is not large on most machines running Python. However if you absolutely must do the removal in situ, here is a clean way of doing it properly:
def filter_by_dict(my_list, my_dict):
sentinel = object()
for i in xrange(len(my_list) - 1, -1, -1):
key = my_list[i][1]
if my_dict.get(key, sentinel) is not sentinel:
del my_list[i]
Update Actually each del costs O(n) shuffling the list pointers down using C's memmove(), so if there are d dels, it's O(n*d) not O(n**2). Note that (1) the OP suggests that d approx == 0.01 * n and (2) the O(n*d) effort is copying one pointer to somewhere else in memory ... so this method could in fact be somewhat faster than a quick glance would indicate. Benchmarks, anyone?
What are you going to do with the list after you have removed the items that are in the dict? Is it possible to piggy-back the dict-filtering onto the next step?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.