Python: in-memory object database which supports indexing? - python

I'm doing some data munging which would be quite a bit simpler if I could stick a bunch of dictionaries in an in-memory database, then run simply queries against it.
For example, something like:
people = db([
{"name": "Joe", "age": 16},
{"name": "Jane", "favourite_color": "red"},
])
over_16 = db.filter(age__gt=16)
with_favorite_colors = db.filter(favorite_color__exists=True)
There are three confounding factors, though:
Some of the values will be Python objects, and serializing them is out of the question (too slow, breaks identity). Of course, I could work around this (eg, by storing all the items in a big list, then serializing their indexes in that list… But that could take a fair bit of fiddling).
There will be thousands of data, and I will be running lookup-heavy operations (like graph traversals) against them, so it must be possible to perform efficient (ie, indexed) queries.
As in the example, the data is unstructured, so systems which require me to predefine a schema would be tricky.
So, does such a thing exist? Or will I need to kludge something together?

What about using an in-memory SQLite database via the sqlite3 standard library module, using the special value :memory: for the connection? If you don't want to write your on SQL statements, you can always use an ORM, like SQLAlchemy, to access an in-memory SQLite database.
EDIT: I noticed you stated that the values may be Python objects, and also that you require avoiding serialization. Requiring arbitrary Python objects be stored in a database also necessitates serialization.
Can I propose a practical solution if you must keep those two requirements? Why not just use Python dictionaries as indices into your collection of Python dictionaries? It sounds like you will have idiosyncratic needs for building each of your indices; figure out what values you're going to query on, then write a function to generate and index for each. The possible values for one key in your list of dicts will be the keys for an index; the values of the index will be a list of dictionaries. Query the index by giving the value you're looking for as the key.
import collections
import itertools
def make_indices(dicts):
color_index = collections.defaultdict(list)
age_index = collections.defaultdict(list)
for d in dicts:
if 'favorite_color' in d:
color_index[d['favorite_color']].append(d)
if 'age' in d:
age_index[d['age']].append(d)
return color_index, age_index
def make_data_dicts():
...
data_dicts = make_data_dicts()
color_index, age_index = make_indices(data_dicts)
# Query for those with a favorite color is simply values
with_color_dicts = list(
itertools.chain.from_iterable(color_index.values()))
# Query for people over 16
over_16 = list(
itertools.chain.from_iterable(
v for k, v in age_index.items() if age > 16)
)

If the in memory database solution ends up being too much work, here is a method for filtering it yourself that you may find useful.
The get_filter function takes in arguments to define how you want to filter a dictionary, and returns a function that can be passed into the built in filter function to filter a list of dictionaries.
import operator
def get_filter(key, op=None, comp=None, inverse=False):
# This will invert the boolean returned by the function 'op' if 'inverse == True'
result = lambda x: not x if inverse else x
if op is None:
# Without any function, just see if the key is in the dictionary
return lambda d: result(key in d)
if comp is None:
# If 'comp' is None, assume the function takes one argument
return lambda d: result(op(d[key])) if key in d else False
# Use 'comp' as the second argument to the function provided
return lambda d: result(op(d[key], comp)) if key in d else False
people = [{'age': 16, 'name': 'Joe'}, {'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("age", operator.gt, 15), people)
# [{'age': 16, 'name': 'Joe'}]
print filter(get_filter("name", operator.eq, "Jane"), people)
# [{'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("favourite_color", inverse=True), people)
# [{'age': 16, 'name': 'Joe'}]
This is pretty easily extensible to more complex filtering, for example to filter based on whether or not a value is matched by a regex:
p = re.compile("[aeiou]{2}") # matches two lowercase vowels in a row
print filter(get_filter("name", p.search), people)
# [{'age': 16, 'name': 'Joe'}]

The only solution I know is a package I stumbled across a few years ago on PyPI, PyDbLite. It's okay, but there are few issues:
It still wants to serialize everything to disk, as a pickle file. But that was simple enough for me to rip out. (It's also unnecessary. If the objects inserted are serializable, so is the collection as a whole.)
The basic record type is a dictionary, into which it inserts its own metadata, two ints under keys __id__ and __version__.
The indexing is very simple, based only on value of the record dictionary. If you want something more complicated, like based on a the attribute of a object in the record, you'll have to code it yourself. (Something I've meant to do myself, but never got around to.)
The author does seem to be working on it occasionally. There's some new features from when I used it, including some nice syntax for complex queries.
Assuming you rip out the pickling (and I can tell you what I did), your example would be (untested code):
from PyDbLite import Base
db = Base()
db.create("name", "age", "favourite_color")
# You can insert records as either named parameters
# or in the order of the fields
db.insert(name="Joe", age=16, favourite_color=None)
db.insert("Jane", None, "red")
# These should return an object you can iterate over
# to get the matching records. These are unindexed queries.
#
# The first might throw because of the None in the second record
over_16 = db("age") > 16
with_favourite_colors = db("favourite_color") != None
# Or you can make an index for faster queries
db.create_index("favourite_color")
with_favourite_color_red = db._favourite_color["red"]
Hopefully it will be enough to get you started.

As far as "identity" anything that is hashable you should be able to compare, to keep track of object identity.
Zope Object Database (ZODB):
http://www.zodb.org/
PyTables works well:
http://www.pytables.org/moin
Also Metakit for Python works well:
http://equi4.com/metakit/python.html
supports columns, and sub-columns but not unstructured data
Research "Stream Processing", if your data sets are extremely large this may be useful:
http://www.trinhhaianh.com/stream.py/
Any in-memory database, that can be serialized (written to disk) is going to have your identity problem. I would suggest representing the data you want to store as native types (list, dict) instead of objects if at all possible.
Keep in mind NumPy was designed to perform complex operations on in-memory data structures, and could possibly be apart of your solution if you decide to roll your own.

I wrote a simple module called Jsonstore that solves (2) and (3). Here's how your example would go:
from jsonstore import EntryManager
from jsonstore.operators import GreaterThan, Exists
db = EntryManager(':memory:')
db.create(name='Joe', age=16)
db.create({'name': 'Jane', 'favourite_color': 'red'}) # alternative syntax
db.search({'age': GreaterThan(16)})
db.search(favourite_color=Exists()) # again, 2 different syntaxes

Not sure if it complies with all your requirements, but TinyDB (using in-memory storage) is also probably worth the try:
>>> from tinydb import TinyDB, Query
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'John', 'age': 22})
>>> User = Query()
>>> db.search(User.name == 'John')
[{'name': 'John', 'age': 22}]
Its simplicity and powerful query engine makes it a very interesting tool for some use cases. See http://tinydb.readthedocs.io/ for more details.

If you are willing to work around serializing, MongoDB could work for you. PyMongo provides an interface almost identical to what you describe. If you decide to serialize, the hit won't be as bad since Mongodb is memory mapped.

It should be possible to do what you are wanting to do with just isinstance(), hasattr(), getattr() and setattr().
However, things are going to get fairly complicated before you are done!
I suppose one could store all the objects in a big list, then run a query on each object, determining what it is and looking for a given attribute or value, then return the value and the object as a list of tuples. Then you could sort on your return values pretty easily. copy.deepcopy will be your best friend and your worst enemy.
Sounds like fun! Good luck!

I started developing one yesterday and it isn't published yet. It indexes your objects and allows you to run fast queries. All data is kept in RAM and I'm thinking about smart load and save methods. For testing purposes it is loading and saving through cPickle.
Let me know if you are still interested.

ducks is exactly what you are describing.
It builds indexes on Python objects
It does not serialize or persist anything
Missing attributes are handled correctly
It uses C libraries so it's very fast and RAM-efficient
pip install ducks
from ducks import Dex, ANY
objects = [
{"name": "Joe", "age": 16},
{"name": "Jane", "favourite_color": "red"},
]
# Build the index
dex = Dex(objects, ['name', 'age', 'favourite_color'])
# Look up by any combination of attributes
dex[{'age': {'>=': 16}}] # Returns Joe
# Match the special value ANY to find all objects with the attribute
dex[{'favourite_color': ANY}] # Returns Jane
This example uses dicts, but ducks works on any object type.

Related

Python switch statement with aliases

How to write an efficient "switch" statement that can return same thing for different input?
Simple switch in Python can be implemented using dictionary like this:
def switch(s):
case = {'phone': '123 456 789', 'website': 'www.example.com'}
return case[s]
This one has constant access time, however I want to use aliases, i.e. switch('website') will return the same thing as switch('site') etc. without duplicating values, i.e. without using
case = {'website': 'www.example.com, 'site': 'www.example.com}
What can be used, is:
def switch(s):
case = {('telephone', 'number', 'phone'): '123 456 789',
('website', 'site'): 'www.example.com'}
for key, value in case.items():
if s in key:
return value
But this approach has worse than linear access time.
It can be made constant, by using
def switch(s):
case = ['123 456 789', 'www.example.com']
aliases = {'telephone': 0, 'number': 0, 'phone': 0,
'website': 1, 'site': 1}
return case[aliases[s]]
but then I'm sort-of duplicating values and in case I decide to remove any answer, I have to edit aliases' and/or case's return values (if I no longer want to return '123 456 789' I have to delete it from case and modify aliases so that aliases['website'] and aliases['site'] return 0 OR leave dummy value in case's 1st cell OR make case a dictionary)
Is there a better way to write such statements?
You can use the linked hashmaps approach:
def switch(s):
alias = {'telephone': 1, 'number': 1, 'phone': 1,
'website': 2, 'site': 2}
case = {1: '123 456 789', 2: 'www.example.com'}
return case[alias[s]]
That way you are keeping the O(1) lookup time.
Of course, for real data, you'll want to automate the construction of alias and case maps, but that should be rather straightforward.
Updates/deletes should also be rather simple, since they come down to simple dict update/delete.
Also, to make insertion of new values easier, you can use UUID4 (or some other random value) instead of numbers.
I would simply use an aliases dictionary without identity aliases besides your original case dictionary and check for potential aliases using get:
def switch(s):
case = {'phone': '123 456 789', 'website': 'www.example.com'}
aliases = {'telephone': 'phone', 'number': 'phone', 'site': 'website'}
return case[aliases.get(s, s)] # check if it's an alias or use the input as-is
That way you don't need to duplicate the values (not in case and not in alias).
In your question you say:
I want to use aliases, i.e. switch('website') will return the same thing as switch('site') etc. without duplicating values
I think your concern about duplicated values is misplaced and you shouldn't reject that approach. Adding an extra dictionary entry with the same string value should not be a problem, and it's the natural way to solve your issue. Don't complicate your code with an extra indirection layer if you don't need to.
I'm assuming your concern with that approach is that it could increase your memory usage, since identical values are stored several times in the dictionary. But most of the time, you won't have multiple separate identical strings, rather, you'll have multiple references to the same string object. Since strings are immutable, Python may substitute in references to preexisting objects when it would appear it should be creating another independent string with the same contents.
You can test this for yourself. Try creating a dictionary with several identical string literals as values, then test the id of each one:
d = {"a": "foo", "b": "foo", "c": "foo"}
for val in d.values():
print(id(val))
On my system this tells me the ids are all the same. I think that multiple identical string literals that are compiled at the same time will always be turned into multiple references to a single string object. In some situations, thanks to string "interning", all strings with certain contents (generally things that look like they could be identifiers) will be shared everywhere in the program. But you probably don't need to care too much about the details. The important thing to realize is that the duplicated strings probably won't use an excessive amount of memory most of the time.
I can't think of any other reason to object to adding all the aliases to a single dictionary. That's the natural solution, so I'd just do it. If memory usage turns out to be an issue later, you might revisit the dictionary to double check that it's being populated with repeated references, not duplicate objects, but I doubt it will matter on the scale of any serious program.
Having code that is easy to use and understand is much more important.
As you've commented that your main concern is not repeating yourself, you might want to set up the dictionary using code to transform another slightly less redundant data structure, rather than doing it directly as a literal.
For instance, the following code uses a dictionary comprehension to turn a list that pairs up sublists of aliases with their values into an easily searchable dictionary:
_data = [ # contains (alias_list, value) 2-tuples
(['telephone', 'number', 'phone'], '123 456 789'),
(['website', 'site'], 'www.example.com'),
]
case = {alias: value for aliases, value in _data for alias in aliases}
You probably want to put this code somewhere where it will only run once (e.g. at the top level, or in a class or instance variable somewhere), rather than having the dictionary comprehension run every time your switch function is called. Because the dictionary is mutable, Python won't assume it can use the same dict object for each call (even though it always has the same value).

Trying to remove duplicates from large list of objects, keep certain one

I have a large list of objects in Python that I'm storing in a text file (for lack of knowledge of how to use any other database for the present).
Currently there are 40,000 but I expect the list length eventually may exceed 1,000,000. I'm trying to remove duplicates, where duplicates are defined as different objects having the same value for a text string attribute, but keep the most recent version of that object (defined as having the highest value in another attribute).
What I want to make is a function that returns only objects 2 and 3 from the following list, reliably:
Object 1: text="hello" ID=1
Object 2: text="hello" ID=2
Object 3: text="something else" ID=3
Doing this manually (looping through the list each time for each object) is too slow already and will get slower with O(l^2), so I need a smarter way to do it. I have seen hashing the objects and using the set function recommended multiple times, but I have two questions about this that I haven't found satisfactory answers to:
How does hashing improve the efficiency to the degree it does?
How can I do this and retain only the most recent such object? The examples I have seen all use the set function and I'm not sure how that would return only the most recent one.
EDIT: I can probably find good answers to question 1 elsewhere, but I am still stuck on question 2. To take another stab at explaining it, hashing the objects above on their text and using the set function will return a set where the objects chosen from duplicates are randomly chosen from each group of duplicates (e.g., above, either a set of (Object 2, Object 3) or (Object 1, Object 3) could be returned; I need (Object 2, Object 3)).
change to using a database ...
import sqlite3
db = sqlite3.connect("my.db")
db.execute("CREATE TABLE IF NOT EXISTS my_items (text PRIMARY KEY, id INTEGER);")
my_list_of_items = [("test",1),("test",2),("asdasd",3)]
db.execute_many("INSERT OR REPLACE INTO my_items (text,id) VALUES (?,?)",my_list_of_items)
db.commit()
print(db.execute("SELECT * FROM my_items").fetchall())
this may have maginally higher overhead in terms of time ... but you will save in RAM
Could use a dict with the text as key and the newest object for each key as value.
Setting up some demo data:
>>> from collections import namedtuple
>>> Object = namedtuple('Object', 'text ID')
>>> objects = Object('foo', 1), Object('foo', 2), Object('bar', 4), Object('bar', 3)
Solution:
>>> unique = {}
>>> for obj in objects:
if obj.text not in unique or obj.ID > unique[obj.text].ID:
unique[obj.text] = obj
>>> unique.values()
[Object(text='foo', ID=2), Object(text='bar', ID=4)]
Hashing is a well-researched subject in Computer Science. One of the standard uses is for implementing what Python calls a dictionary. (Perl calls the same thing a hash, for some reason. ;-) )
The idea is that for some key, such as a string, you can compute a simple numeric function - the hash value - and use that number as a quick way to look up the associated value stored in the dictionary.
Python has the built-in function hash() that returns the standard computation of this value. It also supports the __hash__() function, for objects that wish to compute their own hash value.
In a "normal" scenario, one way to determine if you have seen a field value before would be to use the field value as part of a dictionary. For example, you might stored a dictionary that maps the field in question to the entire record, or a list of records that all share the same field value.
In your case, your data is too big (according to you), so that would be a bad idea. Instead, you might try something like this:
seen_before = {} # Empty dictionary to start with.
while ... something :
info = read_next_record() # You figure this out.
fld = info.fields[some_field] # The value you care about
hv = hash(fld) # Compute hash value for field.
if hv in seen_before:
print("This field value has been seen before")
else:
seen_before[hv] = True # Never seen ... until NOW!

Storing two linked values in Python

I understand there's different ways of storing data in Python but I can't figure what to use for my needs.
I've made a small client/server game, and I want the amount of guesses it took them to be their score. I would then like to write their name (currently the IP address) along with the score into a file as to create a list of high scores. While I can do that perfectly fine, I only want a maximum of 5 scores stored and to be able to sort them so that when I display the high scores and names to the user, the lowest (being the best score) at the top. I'd also like to allow the username to exist more than once.
While it's easy to write the data and read it, I really can't figure out what data type to use, dictionary would make a lot of sense in some cases, but a key can only have one value and the key can only exist once, a list has no relation to other specific values contained within so neither make sense to use, and tuples can't be sorted either it seems.
I was thinking about reading each line into a sperate list and then using the index to compare the score so I could sort them and write it back to the file, but this would be bad on memory in my opinion?
What would be the easiest method to save the name and score together without using some extreme learning curve like SQL?
A format with a pretty small learning curve would be json.
It's basically a dictionary each key can be a number, string, boolean, float, array, or another dictionary (may have more types).
>>> d = {}
>>> d['someId'] = {'somekey': [0,2,3]}
>>> d
{'someId': {'somekey': [0, 2, 3]}}
>>> d['someId']['somekey']
[0, 2, 3]
>>> d['someId']['number'] = 23456776543
>>> d['someId']['number']
23456776543
Once you get the hang of json consider an ODM / ORM for mongo.
Or for the time being, add a helper function to sort by score pulling in the score and name.
>>> d = {'user1': {'name': 'howdy', 'score': 11},'user2': {'name': 'howdy2', 'score': 12}}
>>> d.keys()
['user2', 'user1']
>>> for user_id in d.keys():
... print(d[user_id]['name'], d[user_id]['score'])
...
('howdy2', 12)
('howdy', 11)
If its necessary to setup relational data and models then consider sqlite as it has the basics without being as complicated to setup as postgres or mysql.
Example json reading:
Parsing values from a JSON file using Python?
Example json writing:
How do I write JSON data to a file in Python?
You can store a list of tuples with each tuple containing data in a particular order (eg. score, name etc). Then you can sort the list using list.sort() function by defining a function f to compare the scores and passing the function in sort().

Pymongo or Mongodb is treating two equal python dictionaries as different objects. Can I force them to be treated the same?

Please look at the following lines of code and the results:
import pymongo
d1 = {'p': 0.5, 'theta': 100, 'sigma': 20}
d2 = {'theta': 100, 'sigma': 20, 'p': 0.5}
I get the following results:
d1 == d2 // Returns True
collectn.find({'goods.H': d1}).count() // Returns 33
collectn.find({'goods.H': d2}).count() // Returns 2
where, collectn is a Mongodb collections object.
Is there a setting or a way to query so that I obtain the same results
for the above two queries?
They are essentially using the same dictionary (in
the sense of d1 == d2 being True). I am trying to do the following:
before inserting a record into the database I check whether there
already exists a record with the exact value combination that is being added.
If so, then I don't want to make a new record. But because of the above
shown behavior it becomes possible to get that the record does not exist even
when it does and a duplicate record is added to the database (of course, with different _id
but all other values are the same, and I would prefer not to have that).
Thank you in advance for your help.
The issue you are having is explained in the mongodb documentation here. It also has to do with the fact that Python dictionaries are unordered and MongoDB objects are ordered BSON objects.
The relevant quote being,
Equality matches within subdocuments select documents if the
subdocument matches exactly the specified subdocument, including the
field order.
I think you might be better off if you treat all three properties as subproperties of the main object instead of one collection of properties that is the subobject. That way the ordering of the subobject is not forced into the query by the python interpreter.
For instance...
d1 = {'goods.H.p': 0.5, 'goods.H.theta': 100, 'goods.H.sigma': 20}
d2 = {'goods.H.theta': 100, 'goods.H.sigma': 20, 'goods.H.p': 0.5}
collectn.find(d1).count()
collectn.find(d2).count()
...may yield more consistent results.
Finally, a way to do it changing less code:
collectn.find({'goods.H.' + k:v for k,v in d1.items()})
collectn.find({'goods.H.' + k:v for k,v in d2.items()})
I can only think of two things to do:
Structure your query as this: collectn.find({'goods.H.p':0.5, 'goods.H.theta':100, 'goods.H.sigma':20}).count(). That will find the correct number of documents...
Restructure your data -> if you look at MongoDB : Indexes order and query order must match? you will that you can index on p,sigma,theta so that when, in the query, any order of the terms will provide the correct result. In my brief tests (I am no expert) I was not able to index in a way that produces that same effect with your current structure.
I think your problem is mentioned in mongodb doc:
The field must match the sub-document exactly, including order....
look at documentation here.
There is example with sub-document.
Fields in sub-document have to be in the same order as in query to be matched.
I think you're looking for the $where operator.
This works in Node:
var myCursor = coll.find({$where: function () {return obj.goods.H == d1}});
myCursor.count(function (err, myCount) {console.log(myCount)});
In Python I believe you'll need to pass in a BSON code object.
The documentation warns that the $where operator should be used as a last resort since it comes with a performance penalty, and can't use indexes.
It seems like it may be worthwhile to establish an ordering of the sub properties, and enforce it if possible on insert or as a post process.

Appending associative arrays in Python

I've been all over looking for a solution to this recently, and so far to no avail. I'm coming from php to python, and running into an associative array difference I'm not sure now to overcome.
Take this line:
data[user]={i:{'item':row[0],'time':row[1]}}
This overwrites each of my data[user] entries, obviously, as it's not appending, it's just replacing the data each time.
In php if I wanted to append a new bit of data in a for loop, I could do
data[user][i][]=array('item'=>'x','time'=>'y'); // crude example
In python, I can't do:
data[user][]={i:{'item':row[0],'time':row[1]}}
It barfs on my []
I also can't do:
data[user][i]={'item':row[0],'time':row[1]}
where I is my iterator through the loop... and I think it's because the data[user] hasn't been defined, yet, as of the operating? I've created the data={}, but I don't have it populated with the users as keys, yet.
In python, do I have to have a key defined before I can define it including a sub-key?
I've tried a bunch of .append() options, and other weird tricks, but I want to know the correct method of doing this.
I can do:
data[user,i]={'item':row[0],'time':row[1]}
but this isn't what I want.
What's my proper method here, python friends?
Something like this should work. Note that unlike PHP, Python has separate primitive types for numeric arrays (called "lists") and associative arrays (called "dictionaries" or "dicts").
if user not in data:
data[user] = []
data[user].append({'item': row[0], 'time': row[1]})
import collections
data = collections.defaultdict(list)
for user,row,time in input:
data[user].append({'row':row, 'time':time})
You can also use setdefault() and do something like:
userData = data.setdefault(user, {})
userData[i] = {'item': row[0], 'time': row[1]}
If data[user] is a list, you can append to it with data[user].append({'item':row[0],'time':row[1]}).

Categories

Resources