Relational data structure in python - python

I'm looking for a SQL-relational-table-like data structure in python, or some hints for implementing one if none already exist. Conceptually, the data structure is a set of objects (any objects), which supports efficient lookups/filtering (possibly using SQL-like indexing).
For example, lets say my objects all have properties A, B, and C, which I need to filter by, hence I define the data should be indexed by them. The objects may contain lots of other members, which are not used for filtering. The data structure should support operations equivalent to SELECT <obj> from <DATASTRUCTURE> where A=100 (same for B and C). It should also be possible to filter by more than one field (where A=100 and B='bar').
The requirements are:
Should support a large number of items (~200K). The items must be the objects themselves, and not some flattened version of them (which rules out sqlite and likely pandas).
Insertion should be fast, should avoid reallocation of memory (which pretty much rules out pandas)
Should support simple filtering (like the example above), which must be more efficient than O(len(DATA)), i.e. avoid "full table scans".
Does such data structure exist?
Please don't suggest using sqlite. I'd need to repeatedly convert object->row and row->object, which is time consuming and cumbersome since my objects are not necessarily flat-ish.
Also, please don't suggest using pandas because repeated insertions of rows is too slow as it may requires frequent reallocation.

So long as you don't have any duplicates on (a,b,c) you could sub-class dict, enter your objects indexed by the tuple(a,b,c), and define your filter method (probably a generator) to return all entries that match your criteria.
class mydict(dict):
def filter(self,a=None, b=None, c=None):
for key,obj in enumerate(self):
if (a and (key[0] == a)) or not a:
if (b and (key[1] == b)) or not b:
if (c and (key[2] == c)) or not c:
yield obj
that is an ugly and very inefficient example, but you get the idea. I'm sure there is a better implementation method in itertools, or something.
edit:
I kept thinking about this. I toyed around with it some last night and came up with storing the objects in a list and storing dictionaries of the indexes by the desired keyfields. Retrieve objects by taking the intersection of the indexes for all specified criteria. Like this:
objs = []
aindex = {}
bindex = {}
cindex = {}
def insertobj(a,b,c,obj):
idx = len(objs)
objs.append(obj)
if a in aindex:
aindex[a].append(idx)
else:
aindex[a] = [idx]
if b in bindex:
bindex[b].append(idx)
else:
bindex[b] = [idx]
if c in cindex:
cindex[c].append(idx)
else :
cindex[c] = [idx]
def filterobjs(a=None,b=None,c=None):
if a : aset = set(aindex[a])
if b : bset = set(bindex[b])
if c : cset = set(cindex[c])
result = set(range(len(objs)))
if a and aset : result = result.intersection(aset)
if b and bset : result = result.intersection(bset)
if c and cset : result = result.intersection(cset)
for idx in result:
yield objs[idx]
class testobj(object):
def __init__(self,a,b,c):
self.a = a
self.b = b
self.c = c
def show(self):
print ('a=%i\tb=%i\tc=%s'%(self.a,self.b,self.c))
if __name__ == '__main__':
for a in range(20):
for b in range(5):
for c in ['one','two','three','four']:
insertobj(a,b,c,testobj(a,b,c))
for obj in filterobjs(a=5):
obj.show()
print()
for obj in filterobjs(b=3):
obj.show()
print()
for obj in filterobjs(a=8,c='one'):
obj.show()
it should be reasonably quick, although the objects are in a list, they are accessed directly by index. The "searching" is done on a hashed dict.

Related

I have a trouble with python objects. It was the same object, but now it different

You can see, that i've created two instances of class A. So, a.d (dict of first instance) and b.d (dict of second instance) have to be different! But they are not, we clearly can see that a.d == b.d = True. Okay, so, it should mean that if i will modivy a.d, then b.d will be modifyed too, right? No. Now they are diffrent. And you will say, okay, they just compare together by value, not by it's link value. But i have another trouble with code:
class cluster(object):
def __init__(self, id, points):
self.id = id
self.points = points
self.distances = dict() # maps ["other_cluster"] = distance_to_other_cluster
def set_distance_to_cluster(self, other_cluster, distance):
"""Sets distance to itself and to other cluster"""
assert self.distances != other_cluster.distances
self.distances[other_cluster] = distance
other_cluster.distances[self] = distance
and at the end i'm getting the same "distances" dict object for all clusters. Am i do sth wrong? :'(
Dictionaries are not regular data types, you have to be very careful when working with them.
a = [5]
b = a
b.append(3)
You'd think that with this code that a=[5] and b=[5,3] but in reality they BOTH equal [5, 3].
And by the way when you assigned the value a.d["A"] = 1 you turned the ARRAY into a DICTIONARY, dictionaries don't have the problem above so it didn't come up again.
Solution is to use dictionaries from the start since it suit your data type anyways.

python: return an existing object rather than creating a new object conditionally

My specific situation is as follows: I have an object that takes some arguments, say a, b, c, and d. What I want to happen when I create a new instance of this object is that it checks in a dictionary for the tuple (a,b,c,d), and if this key exists then it returns an existing instance created with arguments a, b, c and d. Otherwise, it will create a new one with arguments a, b, c and d, add it to the dictionary with the key (a,b,c,d), and then return this object.
The code for this isn't complicated, but I don't know where to put it - clearly it can't go in the __init__ method, because assigning to self won't change it, and at this point the new instance has already been made. The problem is that I simply don't know enough about the creation of object instances, and how to do something other than create a new one.
The purpose is to prevent redundancy to save memory in my case; a lot of objects will be made, many of which should be identical because they have the same arguments. They will be immutable, so there would be no danger in changing one of them and affecting the rest. If anyone can give me a way of implementing this, or indeed has a better way than what I have asked that solves the problem, I would appreciate it.
The class is something like:
class X:
dct = {}
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
and somewhere I need the code:
if (a,b,c,d) in X.dct:
return X.dct[(a,b,c,d)]
else:
obj = X(a,b,c,d)
X.dct[(a,b,c,d)] = obj
return obj
and I want this code to run when I do something like:
x = X(a,b,c,d)

cython: reducing the size of a class, reduce memory use, improve speed [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am have a relatively simple problem: given a position in the genome, return the name of the gene at that point.
The way I am solving this problem right now is using the following class in cython::
class BedFile():
""" A Lookup Object """
def __init__(self, bedfile):
self.dict = {}
cdef int start, end
with open(bedfile) as infile:
for line in infile:
f = line.rstrip().split('\t')
if len(f) < 4:
continue
chr = f[0]
start = int(f[1])
end = int(f[2])
gene = f[3]
if chr not in self.dict:
self.dict[chr] = {}
self.dict[chr][gene] = (start, end)
def lookup(self, chromosome, location):
""" Lookup your gene. Returns the gene name """
cdef int l
l = int(location)
answer = ''
for k, v in self.dict[chromosome].items():
if v[0] < l < v[1]:
answer = k
break
if answer:
return answer
else:
return None
The full project is here: https://github.com/MikeDacre/python_bed_lookup, although the entire relevant class is above.
The issue with the code as is is that the resulting class/dictionary take up a very large amount of memory for the human genome, with 110 million genes (that's a 110 million line long text file). I killed the init function in the process of building the dictionary after about two minutes, when it hit 16GB of memory. Anything that uses that much memory is basically useless.
I am sure there must me a more efficient way of doing this, but I am not a C programmer, and I am very new to cython. My guess is that I could build a pure C structure of some kind to hold the gene name and the start and end values. Then I could convert lookup() into a function that calls another cdef function called _lookup(), and use that cdef function to do that actual query.
In an ideal world, the whole structure could live in memory and take up less than 2GB of memory for ~2,000,000 entries (each entry with two ints and a string).
Edit:
I figured out how to do this efficiently with sqlite for large file, to see the complete code with sqlite see here: https://github.com/MikeDacre/python_bed_lookup
However, I still think that the class above can be optimized with cython to make the dictionary smaller in memory and lookups faster, any suggestions are appreciated.
Thanks!
To expand on my suggestion in the comments a bit, instead of storing (start,end) as a tuple, store it as a simple Cython-defined class:
cdef class StartEnd:
cdef public int start, end
def __init__(self, start, end):
self.start = start
self.end = end
(you could also play with changing the integer type for more size savings). I'm not recommending getting rid of the Python dictionaries because they're easy to use, and (I believe) optimised to be reasonably efficient for the (common in Python) case of string keys.
We can estimate the rough size savings by using sys.getsizeof. (Be aware that this will work well for built-in classes and Cython classes, but not so well for Python classes so don't trust it too far. Also be aware that the results are platform dependent so yours may differ slightly).
>>> sys.getsizeof((1,2)) # tuple
64
>>> sys.getsizeof(1) # Python int
28
(therefore each tuple contains 64+28+28=120 bytes)
>>> sys.getsizeof(StartEnd(1,2)) # my custom class
24
(24 makes sense: it's the PyObject_Head (16 bytes: a 64bit integer and a pointer) + 2 32-bit integers).
Therefore, 5 times smaller, which is a good start I think.
In my limited experience with cython and numpy, it is most profitable to use cython for 'inner' calculations that don't need to use Python/numpy code. They are iterations that can be cast to compact and fast C code.
Here's a rewrite of your code, splitting out two classes could be recast as Cython/C structures:
# cython candidate, like DavidW's StartEnd
class Gene(object):
def __init__(self, values):
self.chr = values[0]
self.start = int(values[1])
self.end = int(values[2])
self.gene = values[3]
def find(self, i):
return self.start<=i<self.end
def __repr__(self):
return "%s(%s, %d:%d)"%(self.chr,self.gene,self.start,self.end)
# cython candidate
class Chrom(list):
def add(self, aGene):
self.append(aGene)
def find(self, loc):
# find - analogous to string find?
i = int(loc)
for gene in self:
if gene.find(i):
return gene # gene.gene
return None
def __repr__(self):
astr = []
for gene in self:
astr += [repr(gene)]
return '; '.join(astr)
These would be imported and used by a high level Python function (or class) that does not need to be in the Cython .pdx file:
from collections import defaultdict
def load(anIterable):
data = defaultdict(Chrom)
for line in anIterable:
f = line.rstrip().split(',')
if len(f)<4:
continue
aGene = Gene(f)
data[aGene.chr].add(aGene)
return data
Use with a file or a text simulation:
# befile = 'filename'
# with open(bedfile) as infile:
# data = load(infile)
txt = """\
A, 1,4,a
A, 4,8,b
B, 3,5,a
B, 5,10,c
"""
data = load(txt.splitlines())
print data
# defaultdict(<class '__main__.Chrom'>, {
# 'A': A(a, 1:4); A(b, 4:8),
# 'B': B(a, 3:5); B(c, 5:10)})
print 3, data['A'].find(3) # a gene
print 9, data['B'].find(9) # c gene
print 11,data['B'].find(11) # none
I could define a find function that defers to a method if available, otherwise uses its own. This is analogous to numpy functions that delegate to methods:
def find(chrdata, loc):
# find - analogous to string find?
fn = getattr(chrdata, 'find',None)
if fn is None:
#raise AttributeError(chrdata,'does not have find method')
def fn(loc):
i = int(loc)
for gene in chrdata:
if gene.find(i):
return gene # gene.gene
return None
return fn(loc)
print 3, find(data['A'],3)
Test the find with an ordinary list data structure:
def loadlist(anIterable):
# collect data in ordinary list
data = defaultdict(list)
for line in anIterable:
f = line.rstrip().split(',')
if len(f)<4:
continue
aGene = Gene(f)
data[aGene.chr].append(aGene)
return data
data = loadlist(txt.splitlines())
print 3, find(data['A'],3)

Python OOP __Add__ matrices together (Looping Problem)

class Matrix:
def __init__(self, data):
self.data = data
def __repr__(self):
return repr(self.data)
def __add__(self, other):
data = []
for j in range(len(self.data)):
for k in range(len(self.data[0])):
data.append([self.data[k] + other.data[k]])
data.append([self.data[j] + other.data[j]])
data = []
return Matrix(data)
x = Matrix([[1,2,3],[2,3,4]])
y = Matrix([[10,10,10],[10,10,10]])
print(x + y,x + x + y)
I was able to get Matrices to add for 1 row by n columns, but when I tried to improve it for all n by n matrices by adding in a second loop I got this error.
Traceback (most recent call last):
line 24, in <module>
print(x + y,x + x + y)
line 15, in __add__
data.append([self.data[k] + other.data[k]])
IndexError: list index out of range
How about this:
class Matrix:
def __init__(self, data):
self.data = data
def __repr__(self):
return repr(self.data)
def __add__(self, other):
data = []
for j in range(len(self.data)):
data.append([])
for k in range(len(self.data[0])):
data[j].append(self.data[j][k] + other.data[j][k])
return Matrix(data)
Your code has a few problems... the first is basic logic of the addition algorithm
data.append([self.data[k] + other.data[k]])
this statement is highly suspect... data is a bidimensional matrix but here your are accessing it with a single index. data[k] is therefore a whole row and using + you are concatenating rows (probably not what you wanted, correct?). Probably the solution of highBandWidth is what you were looking for.
The second problem is more subtle, and is about the statement
self.data = data
This may be a problem because Python uses the so-called "reference semantic". Your matrix will use the passed data parameter for the content but without copying it. It will store a reference to the same data list object you passed to the constructor.
May be this is intentional but may be it's not... this is not clear. Is it ok for you that if you build two matrices from the same data and then change the content of a single element in the first also the content of the second changes? If this is not the case then you should copy the elements of data and not just assign the data member for example using
self.data = [row[:] for row in data]
or using copy.deepcopy from the standard copy module.
A third problem is that you are using just two spaces for indenting. This is not smart... when working in python you should use 4 spaces indenting and never use hard tabs chracters. Note that I said that doing this (using two spaces) is not smart, not that you are not smart so please don't take this personally (I even did the very same stupid error myself when starting with python). If you really want to be different then do so by writing amazing bug-free software in python and not by just using a bad indenting or choosing bad names for function or variables. Focus on higher-level beauty.
One last problem is that (once you really understand why your code didn't work) you should really read about python list comprehensions, a tool that can greatly simplify your code if used judiciously. You addition code could for example become
return Matrix([[a + b for a, b in zip(my_row, other_row)]
for my_row, other_row in zip(self.data, other.data)])
To a trained eye this is easier to read than your original code (and it's also faster).

Fetching inherited model objects in django

I have a django application with the following model:
Object A is a simple object extending from Model with a few fields, and let's say, a particular one is a char field called "NAME" and an Integer field called "ORDER". A is abstract, meaning there are no A objects in the database, but instead...
Objects B and C are specializations of A, meaning they inherit from A and they add some other fields.
Now suppose I need all the objects whose field NAME start with the letter "Z", ordered by the ORDER field, but I want all the B and C-specific fields too for those objects. Now I see 2 approaches:
a) Do the queries individually for B and C objects and fetch two lists, merge them, order manually and work with that.
b) Query A objects for names starting with "Z" ordered by "ORDER" and with the result query the B and C objects to bring all the remaining data.
Both approaches sound highly inefficient, in the first one I have to order them myself, in the second one I have to query the database multiple times.
Is there a magical way I'm missing to fetch all B and C objects, ordered in one single method? Or at least a more efficient way to do this than the both mentioned?
Thanks in Advance!
Bruno
If A can be concrete, you can do this all in one query using select_related.
from django.db import connection
q = A.objects.filter(NAME__istartswith='z').order_by('ORDER').select_related('b', 'c')
for obj in q:
obj = obj.b or obj.c or obj
print repr(obj), obj.__dict__ # (to prove the subclass-specific attributes exist)
print "query count:", len(connection.queries)
This question was answered here.
Use the InheritanceManager from the django-model-utils project.
Querying using your "b" method, will allow for you to "bring in" all the remaining data without querying your B and C models separately. You can use the "dot lowercase model name" relation.
http://docs.djangoproject.com/en/dev/topics/db/models/#multi-table-inheritance
for object in A.objects.filter(NAME__istartswith='z').order_by('ORDER'):
if object.b:
// do something
pass
elif object.c:
// do something
pass
You may need to try and except DoesNotExist exceptions. I'm a bit rusty with my django. Good Luck.
So long as you order both queries on B and C, it is fairly easy to merge them without having to do an expensive resort:
# first define a couple of helper functions
def next_or(iterable, other):
try:
return iterable.next(), None
except StopIteration:
return None, other
def merge(x,y,func=lambda a,b: a<=b):
''' merges a pair of sorted iterables '''
xs = iter(x)
ys = iter(y)
a,r = next_or(xs,ys)
b,r = next_or(ys,xs)
while r is None:
if func(a,b):
yield a
a,r = next_or(xs,ys)
else:
yield b
b,r = next_or(ys,xs)
else:
if a is not None:
yield a
else:
yield b
for o in r:
yield o
# now get your objects & then merge them
b_qs = B.objects.filter(NAME__startswith='Z').order_by('ORDER')
c_qs = C.objects.filter(NAME__startswith='Z').order_by('ORDER')
for obj in merge(b_qs,c_qs,lambda a,b: a.ORDER <= b.ORDER):
print repr(obj), obj.__dict__
The advantage of this technique is it works with an abstract base class.

Categories

Resources