Python Sparse Dictionary / Repeated Values

Python Sparse Dictionary / Repeated Values - python

I have a very, very large dictionary of dictionaries. Often the values are the same, and it seems there should be a way to reduce the size by having a reference to the dictionary value that is the same.
Currently I do this with a two-pass method of "Does value have synonym" followed by look up value of synonym.
But ideally it would be great to have a way to do this in a single go.
animals = {
'cat':{'legs':4,'eyes':2},
'dog':{'legs':4,'eyes':2},
'spider':{'legs':8,'eyes':6},
}
I could have a value "mammal" that is used such that I said 'cat':mammal, but what I'd like to be able to do is 'dog':animals['cat']
Because as a reference it should take up less memory which is the goal.
I am contemplating a Class to handle this, but I can't be the first person to think that repeated values in a dictionary could be "squished" somehow, and would prefer to do it in the most pythonic way.

I think object and inheritance are the better way for doing what you want, except maybe for the concern of memory.
For using reference instead of copying the values of each dictionary, you can use the ctypes module:
import ctypes
animals = {'cat':{'legs':4,'eyes':2},'spider':{'legs':8,'eyes':6}}
# You put the value of animals['cat'] in ['dog']
animals['dog'] = id(animals['cat'])
animals
{'dog': 47589527749808, 'spider': {'eyes': 6, 'legs': 8}, 'cat': {'eyes': 2, 'legs': 4}}
# You can access to ['dog'] with
ctypes.cast(animals['dog'], ctypes.py_object).value
{'eyes': 2, 'legs': 4}
Not sure if it is the "most pythonic way" btw. Imho class are the right way to do this.
Another way can by with using the weakref module. I don't know a lot about this one, look this post and the different answers for others hints about using reference.

Related

Fast String "Startswith" Matching for Dict like object

I currently have some code which needs to be very performant, where I am essentially doing a string dictionary key lookup:
class Foo:
def __init__(self):
self.fast_lookup = {"a": 1, "b": 2}
def bar(self, s):
return self.fast_lookup[s]
self.fast_lookup has O(1) lookup time, and there is no try/if etc code that would slow down the lookup
Is there anyway to retain this speed while doing a "startswith" lookup instead? With the code above calling bar on s="az" would result in a key error, if it were changed to a "startswith" implementation then it would return 1.
NB: I am well aware how I could do this with a regex/startswith statement, I am looking for performance specifically for startswith dict lookup

An efficient way to do this would be to use the pyahocorasick module to construct a trie with the possible keys to match, then use the longest_prefix method to determine how much of a given string matches. If no "key" matched, it returns 0, otherwise it will say how much of the string passed exists in the automata.
After installing pyahocorasick, it would look something like:
import ahocorasick
class Foo:
def __init__(self):
self.fast_lookup = ahocorasick.Automaton()
for k, v in {"a": 1, "b": 2}.items():
self.fast_lookup.add_word(k, v)
def bar(self, s):
index = self.fast_lookup.longest_prefix(s)
if not index: # No prefix match at all
raise KeyError(s)
return self.fast_lookup.get(s[:index])
If it turns out the longest prefix doesn't actually map to a value (say, 'cat' is mapped, but you're looking up 'cab', and no other entry actually maps 'ca' or 'cab'), this will die with a KeyError. Tweak as needed to achieve precise behavior desired (you might need to use longest_prefix as a starting point and try to .get() for all substrings of that length or less until you get a hit for instance).
Note that this isn't the primary purpose of Aho-Corasick (it's an efficient way to search for many fixed strings in one or more long strings in a single pass), but tries as a whole are an efficient way to deal with prefix search of this form, and Aho-Corasick is implemented in terms of tries and provides most of the useful features of tries to make it more broadly useful (as in this case).

I dont fully understand the question, but what I would do is try and think of ways to reduce the work the lookup even has to do. If you know the basic searches the startswith is going to do, you can just add those as keys to the dictionary and values that point to the same object. Your dict will get pretty big pretty fast, however it will greatly reduce the lookup i believe. So maybe for a more dynamic method you can add dict keys for the first groups of letters up to three for each entry.
Without activly storing the references for each search, your code will always need to get each dict objects value until it gets one that matches. You cannot reduce that.

Pandas dataframe from dict, why?

I can create a pandas dataframe from dict as follows:
d = {'Key':['abc','def','xyz'], 'Value':[1,2,3]}
df = pd.DataFrame(d)
df.set_index('Key', inplace=True)
And also by first creating a series like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
a = pd.Series(d, name='Value')
df = pd.DataFrame(a)
But not directly like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
df = pd.DataFrame(d)
I'm aware of the from_dict method, and this also gives the desired result:
d = {'abc': 1, 'def': 2, 'xyz': 3}
pd.DataFrame.from_dict(d, orient='index')
but I don't see why:
(1) a separate method is needed to create a dataframe from dict when creating from series or list works without issue;
(2) how/why creating a dataframe from dict/list of lists works, but not creating from dict directly.
Have found several SE answers that offer solutions, but looking for the 'why' as this behavior seems inconsistent. Can anyone shed some light on what I may be missing here.

There's actually a lot happening here, so let's break it down.
The Problem
There are soooo many different ways to create a DataFrame (from a list of records, dict, csv, ndarray, etc ...) that even for python veterans it can take a long time to understand them all. Hell, within each of those ways, there are EVEN MORE ways to build a DataFrame by tweaking some parameters and whatnot.
For example, for dictionaries (where the values are equal length lists), here are two ways pandas can handle them:
Case 1:
You treat each key-value pair as a column title and it's values at each row respectively. In this case, the rows don't have names, and so by default you might just name them by their row index.
Case 2:
You treat each key-value pair as the row's name and it's values at each column respectively. In this case, the columns don't have names, and so by default you might just name them by their index.
The Solution
Python's is a weakly typed language (aka variables don't declare a type and functions don't declare a return). As a result, it doesn't have function overloading. So, you basically have two philosophies when you want to create a object class that can have multiple ways of being constructed:
Create only one constructor that checks the input and handles it accordingly, covering all possible options. This can get very bloated and complicated when certain inputs have their own options/parameters and when there's simply just too much variety.
Separate each option into #classmethod's to handle each specific individual way of constructing the object.
The second is generally better, as it really enforces seperation of concerns as a SE design principle, however the user will need to know all the different #classmethod constructor calls as a result. Although, in my opinion, if you're object class is complicated enough to have many different construction options, the user should be aware of that anyways.
The Panda's Way
Pandas adopts a sorta mix between the two solutions. It'll use the default behaviour for each input type, and it you wanna get any extra functionality you'll need to use the respective #classmethod constructor.
For example, for dicts, by default, if you pass a dict into the DataFrame constructor, it will handle it as Case 1. If you want to do the second case, you'll need to use DataFrame.from_dict and pass in orient='index' (without orient='index', it would would use default behaviour described base Case 1).
In my opinion, I'm not a fan of this kind of implementation. Personally, it's more confusing than helpful. Honestly, a lot of pandas is designed like that. There's a reason why pandas is the topic of every other python tagged question on stackoverflow.

Data structures with Python

Python has a lot of convenient data structures (lists, tuples, dicts, sets, etc) which can be used to make other 'conventional' data structures (Eg, I can use a Python list to create a stack and a collections.dequeue to make a queue, dicts to make trees and graphs, etc).
There are even third-party data structures that can be used for specific tasks (for instance the structures in Pandas, pytables, etc).
So, if I know how to use lists, dicts, sets, etc, should I be able to implement any arbitrary data structure if I know what it is supposed to accomplish?
In other words, what kind of data structures can the Python data structures not be used for?
Thanks

For some simple data structures (eg. a stack), you can just use the builtin list to get your job done. With more complex structures (eg. a bloom filter), you'll have to implement them yourself using the primitives the language supports.
You should use the builtins if they serve your purpose really since they're debugged and optimised by a horde of people for a long time. Doing it from scratch by yourself will probably produce an inferior data structure. Whether you're using Python, C++, C#, Java, whatever, you should always look to the built in data structures first. They will generally be implemented using the same system primitives you would have to use doing it yourself, but with the advantage of having been tried and tested.
Combinations of these data structures (and maybe some of the functions from helper modules such as heapq and bisect) are generally sufficient to implement most richer structures that may be needed in real-life programming; however, that's not invariably the case.
Only when the provided data structures do not allow you to accomplish what you need, and there isn't an alternative and reliable library available to you, should you be looking at building something from scratch (or extending what's provided).
Lets say that you need something more than the rich python library provides, consider the fact that an object's attributes (and items in collections) are essentially "pointers" to other objects (without pointer arithmetic), i.e., "reseatable references", in Python just like in Java. In Python, you normally use a None value in an attribute or item to represent what NULL would mean in C++ or null would mean in Java.
So, for example, you could implement binary trees via, e.g.:
class Node(object):
__slots__ = 'data', 'left', 'right'
def __init__(self, data=None, left=None, right=None):
self.data = data
self.left = left
self.right = right
plus methods or functions for traversal and similar operations (the __slots__ class attribute is optional -- mostly a memory optimization, to avoid each Node instance carrying its own __dict__, which would be substantially larger than the three needed attributes/references).
Other examples of data structures that may best be represented by dedicated Python classes, rather than by direct composition of other existing Python structures, include tries (see e.g. here) and graphs (see e.g. here).

You can use the Python data structures to do anything you like. The entire programming language Lisp (now people use either Common Lisp or Scheme) is built around the linked list data structure, and Lisp programmers can build any data structure they choose.
That said, there are sometimes data structures for which the Python data structures are not the best option. For instance, if you want to build a splay tree, you should either roll your own or use an open-source project like pysplay. If the built-in data structures, solve your problem, use them. Otherwise, look beyond the built-in data structures. As always, use the best tool for the job.

Given that all data structures exist in memory, and memory is effectively just a list (array)... there is no data structure that couldn't be expressed in terms of the basic Python data structures (with appropriate code to interact with them).

It is important to realize that Python can represent hierarchical structures which are combinations of list (or tuple) and dict. For example, list-of-dict or dict-of-list or dict-of-dict are common simple structures. These are so common, that in my own practice, I append the data type to the variable name, like 'parameters_lod'. But these can go on to arbitrary depth.
The _lod datatype can be easily converted into a pandas DataFrame, for example, or any database table. In fact, some realizations of big-data tables use the _lod structure, sometimes omitting the commas between each dict and omitting the surrounding list brackets []. This makes it easy to append to a file of such lines. AWS offers tables that are dict syntax.
A _lod can be easily converted to a _dod if there is a field that is unique and can be used to index the table. An important difference between _lod and _dod is that the _lod can have multiple entries for the same keyfield, whereas a dict is required to have only one. Thus, it is more general to start with the _lod as the primary basic table structure so duplicates are allowed until the table is inspected to combine those entries.
If the lod is turned into dod, it is preferred to keep the entire dict intact, and not remove the item that is used for the keyfield.
a_lod = [
{'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
{'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
{'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
]
a_dod = {'sam': {'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
'sue': {'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
'joe': {'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
}
Thus, the dict key is added but the records are unchanged. We find this is a good practice so the underlying dicts are unchanged.
Pandas DataFrame.append() function is very slow. Therefore, you should not construct a dataframe one record at a time using this syntax:
df = df.append(record)
Instead, build it as a lod and then convert to a df, as follows.
df = pd.DataFrame.from_dict(lod)
This is much faster, as the other method will get slower and slower as the df grows, because the whole df is copied each time.
It has become important in our development to emphasize the use of _lod and avoid field names in each record that are not consistent, so they can be easily converted to Dataframe. So we avoid using key fields in dicts like 'sam':(data) and use {'name':'sam', 'dataname': (arbitrary data)} instead.
The most elegant thing about python structures is the fact that the default is to work with references to the data rather than values. This must be understood because modifying data in a reference will modify the larger structure.
If you want to make a copy, then you need to use .copy() and sometimes .deepcopy or .copy(deep=True) when using Pandas. Then the data structure will be copied, otherwise, a variable name is just a reference.
Further, we discourage using the dol structure, and instead prefer the lodol of dodol. This is because it is best to have each data item identified with a label, which also allows additional fields to be added.

Python: in-memory object database which supports indexing?

I'm doing some data munging which would be quite a bit simpler if I could stick a bunch of dictionaries in an in-memory database, then run simply queries against it.
For example, something like:
people = db([
{"name": "Joe", "age": 16},
{"name": "Jane", "favourite_color": "red"},
])
over_16 = db.filter(age__gt=16)
with_favorite_colors = db.filter(favorite_color__exists=True)
There are three confounding factors, though:
Some of the values will be Python objects, and serializing them is out of the question (too slow, breaks identity). Of course, I could work around this (eg, by storing all the items in a big list, then serializing their indexes in that list… But that could take a fair bit of fiddling).
There will be thousands of data, and I will be running lookup-heavy operations (like graph traversals) against them, so it must be possible to perform efficient (ie, indexed) queries.
As in the example, the data is unstructured, so systems which require me to predefine a schema would be tricky.
So, does such a thing exist? Or will I need to kludge something together?

What about using an in-memory SQLite database via the sqlite3 standard library module, using the special value :memory: for the connection? If you don't want to write your on SQL statements, you can always use an ORM, like SQLAlchemy, to access an in-memory SQLite database.
EDIT: I noticed you stated that the values may be Python objects, and also that you require avoiding serialization. Requiring arbitrary Python objects be stored in a database also necessitates serialization.
Can I propose a practical solution if you must keep those two requirements? Why not just use Python dictionaries as indices into your collection of Python dictionaries? It sounds like you will have idiosyncratic needs for building each of your indices; figure out what values you're going to query on, then write a function to generate and index for each. The possible values for one key in your list of dicts will be the keys for an index; the values of the index will be a list of dictionaries. Query the index by giving the value you're looking for as the key.
import collections
import itertools
def make_indices(dicts):
color_index = collections.defaultdict(list)
age_index = collections.defaultdict(list)
for d in dicts:
if 'favorite_color' in d:
color_index[d['favorite_color']].append(d)
if 'age' in d:
age_index[d['age']].append(d)
return color_index, age_index
def make_data_dicts():
...
data_dicts = make_data_dicts()
color_index, age_index = make_indices(data_dicts)
# Query for those with a favorite color is simply values
with_color_dicts = list(
itertools.chain.from_iterable(color_index.values()))
# Query for people over 16
over_16 = list(
itertools.chain.from_iterable(
v for k, v in age_index.items() if age > 16)
)

If the in memory database solution ends up being too much work, here is a method for filtering it yourself that you may find useful.
The get_filter function takes in arguments to define how you want to filter a dictionary, and returns a function that can be passed into the built in filter function to filter a list of dictionaries.
import operator
def get_filter(key, op=None, comp=None, inverse=False):
# This will invert the boolean returned by the function 'op' if 'inverse == True'
result = lambda x: not x if inverse else x
if op is None:
# Without any function, just see if the key is in the dictionary
return lambda d: result(key in d)
if comp is None:
# If 'comp' is None, assume the function takes one argument
return lambda d: result(op(d[key])) if key in d else False
# Use 'comp' as the second argument to the function provided
return lambda d: result(op(d[key], comp)) if key in d else False
people = [{'age': 16, 'name': 'Joe'}, {'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("age", operator.gt, 15), people)
# [{'age': 16, 'name': 'Joe'}]
print filter(get_filter("name", operator.eq, "Jane"), people)
# [{'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("favourite_color", inverse=True), people)
# [{'age': 16, 'name': 'Joe'}]
This is pretty easily extensible to more complex filtering, for example to filter based on whether or not a value is matched by a regex:
p = re.compile("[aeiou]{2}") # matches two lowercase vowels in a row
print filter(get_filter("name", p.search), people)
# [{'age': 16, 'name': 'Joe'}]

The only solution I know is a package I stumbled across a few years ago on PyPI, PyDbLite. It's okay, but there are few issues:
It still wants to serialize everything to disk, as a pickle file. But that was simple enough for me to rip out. (It's also unnecessary. If the objects inserted are serializable, so is the collection as a whole.)
The basic record type is a dictionary, into which it inserts its own metadata, two ints under keys __id__ and __version__.
The indexing is very simple, based only on value of the record dictionary. If you want something more complicated, like based on a the attribute of a object in the record, you'll have to code it yourself. (Something I've meant to do myself, but never got around to.)
The author does seem to be working on it occasionally. There's some new features from when I used it, including some nice syntax for complex queries.
Assuming you rip out the pickling (and I can tell you what I did), your example would be (untested code):
from PyDbLite import Base
db = Base()
db.create("name", "age", "favourite_color")
# You can insert records as either named parameters
# or in the order of the fields
db.insert(name="Joe", age=16, favourite_color=None)
db.insert("Jane", None, "red")
# These should return an object you can iterate over
# to get the matching records. These are unindexed queries.
#
# The first might throw because of the None in the second record
over_16 = db("age") > 16
with_favourite_colors = db("favourite_color") != None
# Or you can make an index for faster queries
db.create_index("favourite_color")
with_favourite_color_red = db._favourite_color["red"]
Hopefully it will be enough to get you started.

As far as "identity" anything that is hashable you should be able to compare, to keep track of object identity.
Zope Object Database (ZODB):
http://www.zodb.org/
PyTables works well:
http://www.pytables.org/moin
Also Metakit for Python works well:
http://equi4.com/metakit/python.html
supports columns, and sub-columns but not unstructured data
Research "Stream Processing", if your data sets are extremely large this may be useful:
http://www.trinhhaianh.com/stream.py/
Any in-memory database, that can be serialized (written to disk) is going to have your identity problem. I would suggest representing the data you want to store as native types (list, dict) instead of objects if at all possible.
Keep in mind NumPy was designed to perform complex operations on in-memory data structures, and could possibly be apart of your solution if you decide to roll your own.

I wrote a simple module called Jsonstore that solves (2) and (3). Here's how your example would go:
from jsonstore import EntryManager
from jsonstore.operators import GreaterThan, Exists
db = EntryManager(':memory:')
db.create(name='Joe', age=16)
db.create({'name': 'Jane', 'favourite_color': 'red'}) # alternative syntax
db.search({'age': GreaterThan(16)})
db.search(favourite_color=Exists()) # again, 2 different syntaxes

Not sure if it complies with all your requirements, but TinyDB (using in-memory storage) is also probably worth the try:
>>> from tinydb import TinyDB, Query
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'John', 'age': 22})
>>> User = Query()
>>> db.search(User.name == 'John')
[{'name': 'John', 'age': 22}]
Its simplicity and powerful query engine makes it a very interesting tool for some use cases. See http://tinydb.readthedocs.io/ for more details.

If you are willing to work around serializing, MongoDB could work for you. PyMongo provides an interface almost identical to what you describe. If you decide to serialize, the hit won't be as bad since Mongodb is memory mapped.

It should be possible to do what you are wanting to do with just isinstance(), hasattr(), getattr() and setattr().
However, things are going to get fairly complicated before you are done!
I suppose one could store all the objects in a big list, then run a query on each object, determining what it is and looking for a given attribute or value, then return the value and the object as a list of tuples. Then you could sort on your return values pretty easily. copy.deepcopy will be your best friend and your worst enemy.
Sounds like fun! Good luck!

I started developing one yesterday and it isn't published yet. It indexes your objects and allows you to run fast queries. All data is kept in RAM and I'm thinking about smart load and save methods. For testing purposes it is loading and saving through cPickle.
Let me know if you are still interested.

ducks is exactly what you are describing.
It builds indexes on Python objects
It does not serialize or persist anything
Missing attributes are handled correctly
It uses C libraries so it's very fast and RAM-efficient
pip install ducks
from ducks import Dex, ANY
objects = [
{"name": "Joe", "age": 16},
{"name": "Jane", "favourite_color": "red"},
]
# Build the index
dex = Dex(objects, ['name', 'age', 'favourite_color'])
# Look up by any combination of attributes
dex[{'age': {'>=': 16}}] # Returns Joe
# Match the special value ANY to find all objects with the attribute
dex[{'favourite_color': ANY}] # Returns Jane
This example uses dicts, but ducks works on any object type.

Python equivalent to java.util.SortedSet?

Does anybody know if Python has an equivalent to Java's SortedSet interface?
Heres what I'm looking for: lets say I have an object of type foo, and I know how to compare two objects of type foo to see whether foo1 is "greater than" or "less than" foo2. I want a way of storing many objects of type foo in a list L, so that whenever I traverse the list L, I get the objects in order, according to the comparison method I define.
Edit:
I guess I can use a dictionary or a list and sort() it every time I modify it, but is this the best way?

Take a look at BTrees. It look like you need one of them. As far as I understood you need structure that will support relatively cheap insertion of element into storage structure and cheap sorting operation (or even lack of it). BTrees offers that.
I've experience with ZODB.BTrees, and they scale to thousands and millions of elements.

You can use insort from the bisect module to insert new elements efficiently in an already sorted list:
from bisect import insort
items = [1,5,7,9]
insort(items, 3)
insort(items, 10)
print items # -> [1, 3, 5, 7, 9, 10]
Note that this does not directly correspond to SortedSet, because it uses a list. If you insert the same item more than once you will have duplicates in the list.

If you're looking for an implementation of an efficient container type for Python implemented using something like a balanced search tree (A Red-Black tree for example) then it's not part of the standard library.
I was able to find this, though:
http://www.brpreiss.com/books/opus7/
The source code is available here:
http://www.brpreiss.com/books/opus7/public/Opus7-1.0.tar.gz
I don't know how the source code is licensed, and I haven't used it myself, but it would be a good place to start looking if you're not interested in rolling your own container classes.
There's PyAVL which is a C module implementing an AVL tree.
Also, this thread might be useful to you. It contains a lot of suggestions on how to use the bisect module to enhance the existing Python dictionary to do what you're asking.
Of course, using insort() that way would be pretty expensive for insertion and deletion, so consider it carefully for your application. Implementing an appropriate data structure would probably be a better approach.
In any case, to understand whether you should keep the data structure sorted or sort it when you iterate over it you'll have to know whether you intend to insert a lot or iterate a lot. Keeping the data structure sorted makes sense if you modify its content relatively infrequently but iterate over it a lot. Conversely, if you insert and delete members all the time but iterate over the collection relatively infrequently, sorting the collection of keys before iterating will be faster. There is no one correct approach.

Similar to blist.sortedlist, the sortedcontainers module provides a sorted list, sorted set, and sorted dict data type. It uses a modified B-tree in the underlying implementation and is faster than blist in most cases.
The sortedcontainers module is pure-Python so installation is easy:
pip install sortedcontainers
Then for example:
from sortedcontainers import SortedList, SortedDict, SortedSet
help(SortedList)
The sortedcontainers module has 100% coverage testing and hours of stress. There's a pretty comprehensive performance comparison that lists most of the options you'd consider for this.

If you only need the keys, and no associated value, Python offers sets:
s = set(a_list)
for k in sorted(s):
print k
However, you'll be sorting the set each time you do this.
If that is too much overhead you may want to look at HeapQueues. They may not be as elegant and "Pythonic" but maybe they suit your needs.

Use blist.sortedlist from the blist package.
from blist import sortedlist
z = sortedlist([2, 3, 5, 7, 11])
z.add(6)
z.add(3)
z.add(10)
print z
This will output:
sortedlist([2, 3, 3, 5, 6, 7, 10, 11])
The resulting object can be used just like a python list.
>>> len(z)
8
>>> [2 * x for x in z]
[4, 6, 6, 10, 12, 14, 20, 22]

Do you have the possibility of using Jython? I just mention it because using TreeMap, TreeSet, etc. is trivial. Also if you're coming from a Java background and you want to head in a Pythonic direction Jython is wonderful for making the transition easier. Though I recognise that use of TreeSet in this case would not be part of such a "transition".
For Jython superusers I have a question myself: the blist package can't be imported because it uses a C file which must be imported. But would there be any advantage of using blist instead of TreeSet? Can we generally assume the JVM uses algorithms which are essentially as good as those of CPython stuff?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.