Fast String "Startswith" Matching for Dict like object

Fast String "Startswith" Matching for Dict like object - python

I currently have some code which needs to be very performant, where I am essentially doing a string dictionary key lookup:
class Foo:
def __init__(self):
self.fast_lookup = {"a": 1, "b": 2}
def bar(self, s):
return self.fast_lookup[s]
self.fast_lookup has O(1) lookup time, and there is no try/if etc code that would slow down the lookup
Is there anyway to retain this speed while doing a "startswith" lookup instead? With the code above calling bar on s="az" would result in a key error, if it were changed to a "startswith" implementation then it would return 1.
NB: I am well aware how I could do this with a regex/startswith statement, I am looking for performance specifically for startswith dict lookup

An efficient way to do this would be to use the pyahocorasick module to construct a trie with the possible keys to match, then use the longest_prefix method to determine how much of a given string matches. If no "key" matched, it returns 0, otherwise it will say how much of the string passed exists in the automata.
After installing pyahocorasick, it would look something like:
import ahocorasick
class Foo:
def __init__(self):
self.fast_lookup = ahocorasick.Automaton()
for k, v in {"a": 1, "b": 2}.items():
self.fast_lookup.add_word(k, v)
def bar(self, s):
index = self.fast_lookup.longest_prefix(s)
if not index: # No prefix match at all
raise KeyError(s)
return self.fast_lookup.get(s[:index])
If it turns out the longest prefix doesn't actually map to a value (say, 'cat' is mapped, but you're looking up 'cab', and no other entry actually maps 'ca' or 'cab'), this will die with a KeyError. Tweak as needed to achieve precise behavior desired (you might need to use longest_prefix as a starting point and try to .get() for all substrings of that length or less until you get a hit for instance).
Note that this isn't the primary purpose of Aho-Corasick (it's an efficient way to search for many fixed strings in one or more long strings in a single pass), but tries as a whole are an efficient way to deal with prefix search of this form, and Aho-Corasick is implemented in terms of tries and provides most of the useful features of tries to make it more broadly useful (as in this case).

I dont fully understand the question, but what I would do is try and think of ways to reduce the work the lookup even has to do. If you know the basic searches the startswith is going to do, you can just add those as keys to the dictionary and values that point to the same object. Your dict will get pretty big pretty fast, however it will greatly reduce the lookup i believe. So maybe for a more dynamic method you can add dict keys for the first groups of letters up to three for each entry.
Without activly storing the references for each search, your code will always need to get each dict objects value until it gets one that matches. You cannot reduce that.

Related

Remove "First" Item From Python Dict

Good afternoon.
I'm sorry if my question may seem dumb or if it has already been posted (I looked for it but didn't seem to find anything. If I'm wrong, please let me know: I'm new here and I may not be the best at searching for the correct questions).
I was wondering if it was possible to remove (pop) a generic item from a dictionary in python.
The idea came from the following exercise:
Write a function to find the sum of the VALUES in a given dictionary.
Obviously there are many ways to do it: summing dictionary.values(), creating a variable for the sum and iterate through the dict and updating it, etc.. But I was trying to solve it with recursion, with something like:
def total_sum(dictionary):
if dictionary == {}:
return 0
return dictionary.pop() + total_sum(dictionary)
The problem with this idea is that we don't know a priori which could be the "first" key of a dict since it's unordered: if it was a list, the index 0 would have been used and it all would have worked.
Since I don't care about the order in which the items are popped, it would be enough to have a way to delete any of the items (a "generic" item). Do you think something like this is possible or should I necessarily make use of some auxiliary variable, losing the whole point of the use of recursion, whose advantage would be a very concise and simple code?
I actually found the following solution, which though, as you can see, makes the code more complex and harder to read: I reckon it could still be interesting and useful if there was some built-in, simple and direct solution to that particular problem of removing the "first" item of a dict, although many "artificious", alternative solutions could be found.
def total_sum(dictionary):
if dictionary == {}:
return 0
return dictionary.pop(list(dictionary.keys())[0]) + total_sum(dictionary)
I will let you here a simple example dictionary on which the function could be applied, if you want to make some simple tests.
ex_dict = {"milk":5, "eggs":2, "flour": 3}

ex_dict.popitem()
it removes the last (most recently added) element from the dictionary

(k := next(iter(d)), d.pop(k))
will remove the leftmost (first) item (if it exists) from a dict object.
And if you want to remove the right most/recent value from the dict
d.popitem()

You can pop items from a dict, but it get's destroyed in the process. If you want to find the sum of values in a dict, it's probably easiest to just use a list comprehension.
sum([v for v in ex_dict.values()])

Instead of thinking in terms of popping values, a more pythonic approach (as far is recursion is pythonic here) is to use an iterator. You can turn the dict's values into an iterator and use that for recursion. This will be memory efficient, and give you a very clean stopping condition for your recursion:
ex_dict = {"milk":5, "eggs":2, "flour": 3}
def sum_rec(it):
if isinstance(it, dict):
it = iter(it.values())
try:
v = next(it)
except StopIteration:
return 0
return v + sum_rec(it)
sum_rec(ex_dict)
# 10
This doesn't really answer the question about popping values, but that really shouldn't be an option because you can't destroy the input dict, and making a copy just to get the sum, as you noted in the comment, could be pretty expensive.
Using popitem() would be almost the same code. You would just catch a different exception and expect the tuple from the pop. (And of course understand you emptied the dict as a side effect):
ex_dict = {"milk":5, "eggs":2, "flour": 3}
def sum_rec(d):
try:
k,v = d.popitem()
except KeyError:
return 0
return v + sum_rec(d)
sum_rec(ex_dict)
# 10

We can use:
dict.pop('keyname')

How to test if any string is in a list?

I'm trying to make an AI. The AI knows to say 'Hello' to 'hi' and to stop the program on 'bye', and if you say something it doesn't know it will ask you to define it. For example, if you say 'Hello' it will ask what that means. You type 'hi' and from then on when you say 'Hello' it will say 'Hello' back. I store everything in a list called knowledge. It works like this:
knowledge = [[term, definition], [term, definition], [term, definition]]
I am trying to add an edit function, where you type edit foo and it will ask for you to input a string, to change the definition of foo. However, I'm stuck. First, of course, I need to test if it already has a definition for foo. But I can't do that. I need to be able to do it regardless of the definition. In other languages, there is typeOf(). However type() doesn't seem to work. Here's what I have, but it doesn't work:
if [term, type(str)] in knowledge:
Can someone help?

As noted by tehhowch in the comments, a dictionary would be more appropriate as these are "key: value" pairs.
Using a dictionary...
knowledge = {'foo': 'foo def', 'bar': 'bar def', 'baz': 'baz def'}
searchTerm= 'foo'
searchTerm in knowledge
Out[1]: True
Storing knowledge as a list of lists fails because each item in knowledge is itself a list. Therefore, searching those lists for a string type (your term) fails. Instead, you could pull the terms out as a separate list and then check that one list for the term you're looking for.
knowledge = [["foo", "foo definition"], ["bar", "bar definition"], ["baz", "baz
definition"]]
terms = [item[0] for item in knowledge]
searchTerm= "foo"
searchTerm in terms
Out[1]: True

As others have mentioned, Python would typically use a dict for this kind of associative array. You approach is analogous to a Lisp data structure called an Association List. These are less efficient than the hashtable structures used by dicts, but they still have some useful properties.
For example, if you look up a key by scanning through the pairs and getting the first one, this means that you can insert another pair with the same key at the front and it will shadow the old value. You don't have to remove it. This makes insertions fast (at least with Lisp-style linked lists). You can also "undo" this operation by deleting the new one, and the old one will then be found by the scanner.
Your check if [term, type(str)] in knowledge: could be made to work as
if [term, str] in ([term, type(definition)] for term, definition in knowledge):
This uses a generator expression to convert your term, definition pairs into term, type(definition) pairs on the fly.

You can use dictionary to store definitions rather than list of lists and python's isinstance function will help you check if term belongs to specific class or not. see below example:
knowledge = {'Hello':'greeting','Hi':'greeting','Bye':'good bye'}
term = "Hello"
if isinstance(term, str):
if term in knowledge:
print("Defination exist")
else:
print("Defination doesn't exist")
else:
print("Entered term is not string")

More elegant / pythonic way to append to an array, or create it

For some reason the following code felt a bit cumbersome to me, given all the syntactic sugar I keep finding in Python, so I thought enquire if there's a better way:
pictures = list_of_random_pictures()
invalid_pictures = {}
for picture in pictures:
if picture_invalid(picture):
if not invalid_pictures.get(picture.album.id):
invalid_pictures[picture.album.id] = []
invalid_pictures[picture.album.id].append(picture)
So just to clarify, I'm wondering if there's a more readable way to take care of the last 3 lines above. As I'm repeating invalid_pictures[picture.album.id] 3 times, it just seems unnecessary if it's at all avoidable.
EDIT: Just realized my code above will KeyError, so I've altered it to handle that.

There is, in fact, an object in the standard library to deal exactly with that case: collections.defaultdict
It takes as its argument either a type or function which produces the default value for a missing key.
from collections import defaultdict
invalid_pictures = defaultdict(list)
for picture in pictures:
if picture_invalid(picture):
invalid_pictures[picture.album.id].append(picture)

Although I do find Joel's answer to be the usual solution, there's an often overlooked feature that is sometimes preferrable. when a default value isn't particularly desired is to use dict.setdefault(), when inserting the default is something of an exception (and collections.defaultdict is suboptimal. your code, adapted to use setdefault() would be
pictures = list_of_random_pictures()
invalid_pictures = {}
for picture in pictures:
if picture_invalid(picture):
invalid_pictures.setdefault(picture.album.id, []).append(picture)
although this has the downside of creating a lot of (probably unused) lists, so in the particular case of setting the default to a list, it's still often better to use a defaultdict, and then copy it to a regular dict to strip off the default value. This doesn't particularly apply to immutable types, such as int or str.
For example, pep-333 requires that the environ variable be a regular dict, not a subclass or instance of collections.Mapping, but a real, plain ol' dict. Another case is when you aren't so much as passing through an iterator and applying to a flat mapping; but each key needs special handling, such as when applying defaults to a configuration.

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.

The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).

Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.

The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.

There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.

C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/

A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.

Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Using 'in' to match an attribute of Python objects in an array

I don't remember whether I was dreaming or not but I seem to recall there being a function which allowed something like,
foo in iter_attr(array of python objects, attribute name)
I've looked over the docs but this kind of thing doesn't fall under any obvious listed headers

Using a list comprehension would build a temporary list, which could eat all your memory if the sequence being searched is large. Even if the sequence is not large, building the list means iterating over the whole of the sequence before in could start its search.
The temporary list can be avoiding by using a generator expression:
foo = 12
foo in (obj.id for obj in bar)
Now, as long as obj.id == 12 near the start of bar, the search will be fast, even if bar is infinitely long.
As #Matt suggested, it's a good idea to use hasattr if any of the objects in bar can be missing an id attribute:
foo = 12
foo in (obj.id for obj in bar if hasattr(obj, 'id'))

Are you looking to get a list of objects that have a certain attribute? If so, a list comprehension is the right way to do this.
result = [obj for obj in listOfObjs if hasattr(obj, 'attributeName')]

you could always write one yourself:
def iterattr(iterator, attributename):
for obj in iterator:
yield getattr(obj, attributename)
will work with anything that iterates, be it a tuple, list, or whatever.
I love python, it makes stuff like this very simple and no more of a hassle than neccessary, and in use stuff like this is hugely elegant.

No, you were not dreaming. Python has a pretty excellent list comprehension system that lets you manipulate lists pretty elegantly, and depending on exactly what you want to accomplish, this can be done a couple of ways. In essence, what you're doing is saying "For item in list if criteria.matches", and from that you can just iterate through the results or dump the results into a new list.
I'm going to crib an example from Dive Into Python here, because it's pretty elegant and they're smarter than I am. Here they're getting a list of files in a directory, then filtering the list for all files that match a regular expression criteria.
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = [f for f in files if test.search(f)]
You could do this without regular expressions, for your example, for anything where your expression at the end returns true for a match. There are other options like using the filter() function, but if I were going to choose, I'd go with this.
Eric Sipple

The function you are thinking of is probably operator.attrgettter. For example, to get a list that contains the value of each object's "id" attribute:
import operator
ids = map(operator.attrgetter("id"), bar)
If you want to check whether the list contains an object with an id == 12, then a neat and efficient (i.e. doesn't iterate the whole list unnecessarily) way to do it is:
any(obj.id == 12 for obj in bar)
If you want to use 'in' with attrgetter, while still retaining lazy iteration of the list:
import operator,itertools
foo = 12
foo in itertools.imap(operator.attrgetter("id"), bar)

What I was thinking of can be achieved using list comprehensions, but I thought that there was a function that did this in a slightly neater way.
i.e. 'bar' is a list of objects, all of which have the attribute 'id'
The mythical functional way:
foo = 12
foo in iter_attr(bar, 'id')
The list comprehension way:
foo = 12
foo in [obj.id for obj in bar]
In retrospect the list comprehension way is pretty neat anyway.

If you plan on searching anything of remotely decent size, your best bet is going to be to use a dictionary or a set. Otherwise, you basically have to iterate through every element of the iterator until you get to the one you want.
If this isn't necessarily performance sensitive code, then the list comprehension way should work. But note that it is fairly inefficient because it goes over every element of the iterator and then goes BACK over it again until it finds what it wants.
Remember, python has one of the most efficient hashing algorithms around. Use it to your advantage.

I think:
#!/bin/python
bar in dict(Foo)
Is what you are thinking of. When trying to see if a certain key exists within a dictionary in python (python's version of a hash table) there are two ways to check. First is the has_key() method attached to the dictionary and second is the example given above. It will return a boolean value.
That should answer your question.
And now a little off topic to tie this in to the list comprehension answer previously given (for a bit more clarity). List Comprehensions construct a list from a basic for loop with modifiers. As an example (to clarify slightly), a way to use the in dict language construct in a list comprehension:
Say you have a two dimensional dictionary foo and you only want the second dimension dictionaries which contain the key bar. A relatively straightforward way to do so would be to use a list comprehension with a conditional as follows:
#!/bin/python
baz = dict([(key, value) for key, value in foo if bar in value])
Note the if bar in value at the end of the statement**, this is a modifying clause which tells the list comprehension to only keep those key-value pairs which meet the conditional.** In this case baz is a new dictionary which contains only the dictionaries from foo which contain bar (Hopefully I didn't miss anything in that code example... you may have to take a look at the list comprehension documentation found in docs.python.org tutorials and at secnetix.de, both sites are good references if you have questions in the future.).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast String "Startswith" Matching for Dict like object - python

Related

Remove "First" Item From Python Dict

How to test if any string is in a list?

More elegant / pythonic way to append to an array, or create it

C data structures

Using 'in' to match an attribute of Python objects in an array

Categories

Resources