In my previous question a lot of users wanted me to give some more data to toy with. So I got working on exporting all my data and processing it with Python, but then I realized: where do I leave all this data?
Well I decided the best thing would be to stick them in a database, so at least I don't have to parse the raw files every time. But since I know nothing about databases this is turning out to be quite confusing. I tried some tutorials to create a sqlite database, add a table and field and try to insert my numpy.arrays, but it can't get it to work.
Typically my results per dog look like this:
So I have 35 different dogs and each dog has 24 measurement. Each measurement itself has an unknown amount of contacts. Each measurement consists out of a 3D array (248 frames of the whole plate [255x63]) and a 2D array (the maximal values for each sensor of the plate [255x63]). Storing one value in a database wasn't a problem, but getting my 2D arrays in there didn't seem to work.
So my question is how should I order this in a database and insert my arrays into it?
You'll probably want to start out with a dogs table containing all the flat (non array) data for each dog, things which each dog has one of, like a name, a sex, and an age:
CREATE TABLE `dogs` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`name` VARCHAR(64),
`age` INT UNSIGNED,
`sex` ENUM('Male','Female')
);
From there, each dog "has many" measurements, so you need a dog_mesaurements table to store the 24 measurements:
CREATE TABLE `dog_measurements` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`dog_id` INT UNSIGNED NOT NULL,
`paw` ENUM ('Front Left','Front Right','Rear Left','Rear Right'),
`taken_at` DATETIME NOT NULL
);
Then whenever you take a measurement, you INSERT INTO dog_measurements (dog_id,taken_at) VALUES (*?*, NOW()); where * ? * is the dog's ID from the dogs table.
You'll then want tables to store the actual frames for each measurement, something like:
CREATE TABLE `dog_measurement_data` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`dog_measurement_id` INT UNSIGNED NOT NULL,
`frame` INT UNSIGNED,
`sensor_row` INT UNSIGNED,
`sensor_col` INT UNSIGNED,
`value` NUMBER
);
That way, for each of the 250 frames, you loop through each of the 63 sensors, and store the value for that sensor with the frame number into the database:
INSERT INTO `dog_measurement_data` (`dog_measurement_id`,`frame`,`sensor_row`,`sensor_col`,`value`) VALUES
(*measurement_id?*, *frame_number?*, *sensor_row?*, *sensor_col?*, *value?*)
Obviously replace measurement_id?, frame_number?, sensor_number?, value? with real values :-)
So basically, each dog_measurement_data is a single sensor value for a given frame. That way, to get all the sensor values for all a given frame, you would:
SELECT `sensor_row`,sensor_col`,`value` FROM `dog_measurement_data`
WHERE `dog_measurement_id`=*some measurement id* AND `frame`=*some frame number*
ORDER BY `sensor_row`,`sensor_col`
And this will give you all the rows and cols for that frame.
Django has a library for encapsulating all the database work into Python classes, so you don't have to mess with raw SQL until you have to do something really clever. Even though Django is a framework for web applications, you can use the database ORM by itself.
Josh's models would look like this in Python using Django:
from django.db import models
class Dog(models.Model):
# Might want to look at storing birthday instead of age.
# If you track age, you probably need another field telling
# you when in the year age goes up by 1... and at that point,
# you're really storing a birthday.
name = models.CharField(max_length=64)
age = models.IntegerField()
genders = [
('M', 'Male'),
('F', 'Female'),
]
gender = models.CharField(max_length=1, choices=genders)
class Measurement(models.Model):
dog = models.ForeignKey(Dog, related_name="measurements")
paws = [
('FL', 'Front Left'),
('FR', 'Front Right'),
('RL', 'Rear Left'),
('RR', 'Rear Right'),
]
paw = models.CharField(max_length=2, choices=paws)
taken_at = models.DateTimeField(default=date, auto_now_add=True)
class Measurement_Point(models.Model):
measurement = models.ForeignKey(Measurement, related_name="data_points")
frame = models.IntegerField()
sensor_row = models.PositiveIntegerField()
sensor_col = models.PositiveIntegerField()
value = models.FloatField()
class Meta:
ordering = ['frame', 'sensor_row', 'sensor_col']
The id fields are created automatically.
Then you can do things like:
dog = Dog()
dog.name = "Pochi"
dog.age = 3
dog.gender = 'M'
# dog.gender will return 'M', and dog.get_gender_display() will return 'Male'
dog.save()
# Or, written another way:
dog = Dog.objects.create(name="Fido", age=3, sex='M')
To take a measurement:
measurement = dog.measurements.create(paw='FL')
for frame in range(248):
for row in range(255):
for col in range(63):
measurement.data_points.create(frame=frame, sensor_row=row,
sensor_col=col, value=myData[frame][row][col])
Finally, to get a frame:
# For the sake of argument, assuming the dogs have unique names.
# If not, you'll need some more fields in the Dog model to disambiguate.
dog = Dog.objects.get(name="Pochi", sex='M')
# For example, grab the latest measurement...
measurement = dog.measurements.all().order_by('-taken_at')[0]
# `theFrameNumber` has to be set somewhere...
theFrame = measurement.filter(frame=theFrameNumber).values_list('value')
Note: this will return a list of tuples (e.g. [(1.5,), (1.8,), ... ]), since values_list() can retrieve multiple fields at once. I'm not familiar with NumPy, but I'd imagine it's got a function similar to Matlab's reshape function for remapping vectors to matrices.
I think you are not able to figure out how to put 2D data in database.
If you think of relation between 2 columns, you can think of it as 2D data with 1st column as X axis data and 2nd column as Y axis data. Similarly for 3D data.
Finally your db should look like this:
Table: Dogs
Columns: DogId, DogName -- contains data for each dog
Table: Measurements
Columns: DogId, MeasurementId, 3D_DataId, 2D_DataId -- contains measurements of each dog
Table: 3D_data
Columns: 3D_DataId, 3D_X, 3D_Y, 3D_Z -- contains all 3D data of a measurement
Table: 2D_data
Columns: 2D_DataId, 2D_X, 2D_Y -- contains all 2D data of a measurement
Also you may want to store your 3D data and 2D data in an order. In that case, you will have to add a column to store that order in table of 3D data and 2D data
The only thing I would add to Josh's answer is that if you don't need to query individual frames or sensors, just store the arrays as BLOBs in the dog_measurement_data table. I have done this before with large binary set of sensor data and it worked out well. You basically query the 2d and 3d arrays with each measurement and manipulate them in code instead of the database.
I've benefited a lot from the sqlalchemy package; it is an Object Relational Mapper. What this means is that you can create a very clear and distinct separation between your objects and your data:
SQL databases behave less like object
collections the more size and
performance start to matter; object
collections behave less like tables
and rows the more abstraction starts
to matter. SQLAlchemy aims to
accommodate both of these principles.
You can create a objects representing your different nouns (Dog, Measurement, Plate, etc). Then you create a table via sqlalchemy constructs which will contain all the data that you want to associate with, say, a Dog object. Finally you create a mapper between the Dog object and the dog_table.
This is difficult to understand without an example, and I will not reproduce one here. Instead, please start by reading this case study and then study this tutorial.
Once you can think of your Dogs and Measurements as you do in the real world (that is, the objects themselves) you can start factoring out the data that makes them up.
Finally, try not to marry your data with an specific format (as you do currently by using numpy arrays). Instead, you can think of the simple numbers and then transform them on demand to the specific format your current application demands (along the lines of a Model-View-Controller paradigm).
Good luck!
From your description, I would highly recommend looking into PyTables. It's not a relational database in the traditional sense, it has most of the features you're likely to be using (e.g. querying) while allowing easy storage of large, multidimensional datasets and their attributes. As an added bonus, it's tightly integrated with numpy.
Related
I have a datamodel where I store a list of values separated by comma (1,2,3,4,5...).
In my code, in order to work with arrays instead of string, I have defined the model like this one:
class MyModel(db.Model):
pk = db.Column(db.Integer, primary_key=True)
__fake_array = db.Column(db.String(500), name="fake_array")
#property
def fake_array(self):
if not self.__fake_array:
return
return self.__fake_array.split(',')
#fake_array.setter
def fake_array(self, value):
if value:
self.__fake_array = ",".join(value)
else:
self.__fake_array = None
This works perfect and from the point of view of my source code "fake_array" is an array, It's only transformed into string when it's stored in database.
The problem appears when I try to filter by that field. Expressions like this doesn't work:
MyModel.query.filter_by(fake_array="1").all()
It seems that I cant filter using the SqlAlchemy query model.
What can I do here? Is there any way to filter this kind of fields? Is there is a better pattern for the "fake_array" problem?
Thanks!
What you're trying to do should really be replaced with a pair of tables and a relationship between them.
The first table (which I'll call A) contains everything BUT the array column, and it should have a primary key of some sort. You should have another table (which I'll call B) that contains a primary key, a foreign key column to A (which I'll call a_id, and an integer field.
Using this layout, each row in the A table has its associated array in table B where B's a_id == A.id via a join. You can add or remove values from the array by manipulating the rows in table B. You can filter by using a join.
If the order of the values is needed, then create an order column in table B.
I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.
I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.
Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)
so one record might be:
[(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]
I need to develop a python script that will take these records and create various counts.
The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3
The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.
Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.
I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!
At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).
At first, I create some fake data(it was provided by you):
import pandas as pd
import numpy as np
from ast import literal_eval
data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]
df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df
This return a dataframe like this:
raw_pandas_dataframe
In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.
We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.
df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')
df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df
The result is:
dataframe_1
With our relationships column converted to list, we can create a new dataframe with as much columns
as relations in that lists we have.
rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)
After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)
relations = rel.stack()
relations.index.names = ['id','relation_number']
relations
We get: dataframe_2
At this moment we have all of our relations in rows, but still we can't group by using
relation_type feature. So we will split our relations data in two columns: relation_type and department using :.
clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations
The result is
dataframe_3_clear_relations
Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.
flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']
The result is dataframe_4_clear_flags
Voilá!, It's all ready to analyze!.
So, for example, how many relations from each type we have, and wich one is the biggest:
clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)
We get: group_by_relation_type
All code: Github project
If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.
For example,
import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)
# Your output might look like the following:
0 1 2
0 12346 (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1 12345 (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)
# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?
# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]
The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas.
You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.
df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]
We have been using Cassandra for awhile now and we are trying to get a really optimized table going that will be able to quickly query and filter on about 100k rows.
Our model looks something like this:
class FailedCDR(Model):
uuid = columns.UUID(partition_key=True, primary_key=True)
num_attempts = columns.Integer(index=True)
datetime = columns.Integer()
If I describe the table it clearly shows that num_attempts is index.
CREATE TABLE cdrs.failed_cdrs (
uuid uuid PRIMARY KEY,
datetime int,
num_attempts int
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX index_failed_cdrs_num_attempts ON cdrs.failed_cdrs (num_attempts);
We want to be able to run a filter similar to this:
failed = FailedCDR.filter(num_attempts__lte=9)
But this happens:
QueryException: Where clauses require either a "=" or "IN" comparison with either a primary key or indexed field
How can we accomplish a similar task?
If you want to do a range query in CQL, you need the field to be a clustering column.
So you'll want the num_attempts field to be a clustering column.
Also if you want to do a single query, you need all the rows you want to query in the same partition (or a small number of partitions that you can access using an IN clause). Since you only have 100K rows, that is small enough to fit in one partition.
So you could define your table like this:
CREATE TABLE test.failed_cdrs (
partition int,
num_attempts int,
uuid uuid,
datetime int,
PRIMARY KEY (partition, num_attempts, uuid));
You would insert your data with a constant for the partition key, such as 1.
INSERT INTO failed_cdrs (uuid, datetime, num_attempts, partition)
VALUES ( now(), 123, 5, 1);
Then you can do range queries like this:
SELECT * from failed_cdrs where partition=1 and num_attempts >=8;
The drawback to this method is that to change the value of num_attempts, you need to delete the old row and insert a new row since you are not allowed to update key fields. You could do the delete and insert for that in a batch statement.
A better option that will become available in Cassandra 3.0 is to make a materialized view that has num_attempts as a clustering column, in which case Cassandra would take care of the delete and insert for you when you updated num_attempts in the base table. The 3.0 release is currently in beta testing.
Maybe this question will be made more clear through an example. Let's say the dataset I'm working with is a whole bunch (several gigabytes) of variable-length lists of tuples, each associated with a unique ID and a bit of metadata, and I want to be able quickly retrieve any of these lists by its ID.
I currently have two tables set up more or less like this:
TABLE list(
id VARCHAR PRIMARY KEY,
flavor VARCHAR,
type VARCHAR,
list_element_start INT,
list_element_end INT)
TABLE list_element(
id INT PRIMARY KEY,
value1 FLOAT,
value2 FLOAT)
To pull a specific list out of the database I currently do something like this:
SELECT list_element_start, list_element_end FROM list WHERE id = 'my_list_id'
Then I use the retrieved list_element_start and list_element_end values to get the list elements:
SELECT *
FROM list_element
WHERE id BETWEEN(my_list_element_start, my_list_element_end)
Of course, this works very fast, but I feel as though there's a better way to do this. I'm aware that I could have another column in list_element_end called list_id, and then do something like SELECT * FROM list_element WHERE list_id = 'my_list_id' ORDER BY id. However, it seems to me that having that extra column, as well as a foreign key index on that column would take up a lot of unnecessary space.
Is there simpler way to do this?
Apologies if this question has been asked before, but I was unable to locate the answer. I'd also like to use SQLAlchemy in Python to do all of this, if possible.
Thanks in advance!
between is not a function so I don't know what you think is going on there. Anyway... Why not:
SELECT e.*
FROM list_element e
Join list l
On l.id between e.my_list_element_start and my_list_element_end
Or am I missing something
You can normalize each element of your array into a row. The following is the declarative style in SQLAlchemy that will give you a "MyList" object with flavor etc, and then elements will be an actual Python list of each "MyElement" object. You could get more complicated to weed out the extra id and idx within the returned element list, but this should be plenty fast enough.
Also, above, you had mixed varchar and int for your primary key, not sure if it was just oversight, but you ought not do that. Additionally, when handling large data sets remember options like chunking. You can use offset and limit to work with smaller sizes and process iteratively.
class MyList(Base):
__tablename__ = 'my_list'
id = Column(Integer, primary_key=True)
flavor = Column(String)
list_type = Column(String)
elements = Relationship('my_element', order_by='my_element.idx')
class MyElement(Base):
__tablename__ = 'my_element'
id = Column(Integer, ForeignKey('my_list.id'))
idx = Column(Integer)
val = Column(Integer)
__table_args__ = (PrimaryKeyConstraint('id','idx'), )
Say I have the following variables and its corresponding values which represents a record.
name = 'abc'
age = 23
weight = 60
height = 174
Please note that the value could be of different types (string, integer, float, reference-to-any-other-object, etc).
There will be many records (at least >100,000). Each and every record will be unique when all these four variables (actually its values) are put together. In other words, there exists no record with all 4 values are the same.
I am trying to find an efficient data structure in Python which will allow me to (store and) retrieve records based on any one of these variables in log(n) time complexity.
For example:
def retrieve(name=None,age=None,weight=None,height=None)
if name is not None and age is None and weight is None and height is None:
/* get all records with the given name */
if name is None and age is not None and weight is None and height is None:
/* get all records with the given age */
....
return records
The way the retrieve should be called is as follows:
retrieve(name='abc')
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]
retrieve(age=23)
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]
And, I may need to add one or two more variables to this record in future. For example, say, sex = 'm'. So, the retrieve function must be scalable.
So in short: Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?
There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it does have to achieve your goals and do so fairly efficiently.
For example, say your input was the following data in a comma-separated-value file called employees.csv with field names defined as shown by the first line:
name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in
The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.
The records are instances of a class created by namedtuple which is a very memory efficient because each one lacks a __dict__ attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, like record.fieldname.
The look-up tables are defaultdict(list) instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought, and the data associated with it will be a list of the integer indices of the Person records stored in the employees list with that value — so they'll all be relatively small.
Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. Of course when using an instance, all retrieve() method calls must provide valid field names.
Update
Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the retrieve() method "lazily" creates them only when one is needed (and saves/caches the result for future use). Also modified to work in Python 2.7+ including 3.x.
from collections import defaultdict, namedtuple
import csv
class DataBase(object):
def __init__(self, csv_filename, recordname):
# Read data from csv format file into a list of namedtuples.
with open(csv_filename, 'r') as inputfile:
csv_reader = csv.reader(inputfile, delimiter=',')
self.fields = next(csv_reader) # Read header row.
self.Record = namedtuple(recordname, self.fields)
self.records = [self.Record(*row) for row in csv_reader]
self.valid_fieldnames = set(self.fields)
# Create an empty table of lookup tables for each field name that maps
# each unique field value to a list of record-list indices of the ones
# that contain it.
self.lookup_tables = {}
def retrieve(self, **kwargs):
""" Fetch a list of records with a field name with the value supplied
as a keyword arg (or return None if there aren't any).
"""
if len(kwargs) != 1: raise ValueError(
'Exactly one fieldname keyword argument required for retrieve function '
'(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
field, value = kwargs.popitem() # Keyword arg's name and value.
if field not in self.valid_fieldnames:
raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
if field not in self.lookup_tables: # Need to create a lookup table?
lookup_table = self.lookup_tables[field] = defaultdict(list)
for index, record in enumerate(self.records):
field_value = getattr(record, field)
lookup_table[field_value].append(index)
# Return (possibly empty) sequence of matching records.
return tuple(self.records[index]
for index in self.lookup_tables[field].get(value, []))
if __name__ == '__main__':
empdb = DataBase('employees.csv', 'Person')
print("retrieve(name='Ted Kingston'): {}".format(empdb.retrieve(name='Ted Kingston')))
print("retrieve(age='27'): {}".format(empdb.retrieve(age='27')))
print("retrieve(weight='150'): {}".format(empdb.retrieve(weight='150')))
try:
print("retrieve(hight='5ft 6in'):".format(empdb.retrieve(hight='5ft 6in')))
except ValueError as e:
print("ValueError('{}') raised as expected".format(e))
else:
raise type('NoExceptionError', (Exception,), {})(
'No exception raised from "retrieve(hight=\'5ft\')" call.')
Output:
retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
raised as expected
Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?
No, there is none. But you could try to implement one on the basis of one dictionary per value dimension. As long as your values are hashable of course. If you implement a custom class for your records, each dictionary will contain references to the same objects. This will save you some memory.
You could achieve logarithmic time complexity in a relational database using indexes (O(log(n)**k) with single column indexes). Then to retrieve data just construct appropriate SQL:
names = {'name', 'age', 'weight', 'height'}
def retrieve(c, **params):
if not (params and names.issuperset(params)):
raise ValueError(params)
where = ' and '.join(map('{0}=:{0}'.format, params))
return c.execute('select * from records where ' + where, params)
Example:
import sqlite3
c = sqlite3.connect(':memory:')
c.row_factory = sqlite3.Row # to provide key access
# create table
c.execute("""create table records
(name text, age integer, weight real, height real)""")
# insert data
records = (('abc', 23, 60, 174+i) for i in range(2))
c.executemany('insert into records VALUES (?,?,?,?)', records)
# create indexes
for name in names:
c.execute("create index idx_{0} on records ({0})".format(name))
try:
retrieve(c, naame='abc')
except ValueError:
pass
else:
assert 0
for record in retrieve(c, name='abc', weight=60):
print(record['height'])
Output:
174.0
175.0
Given http://wiki.python.org/moin/TimeComplexity how about this:
Have a dictionary for every column you're interested in - AGE, NAME, etc.
Have the keys of that dictionaries (AGE, NAME) be possible values for given column (35 or "m").
Have a list of lists representing values for one "collection", e.g. VALUES = [ [35, "m"], ...]
Have the value of column dictionaries (AGE, NAME) be lists of indices from the VALUES list.
Have a dictionary which maps column name to index within lists in VALUES so that you know that first column is age and second is sex (you could avoid that and use dictionaries, but they introduce large memory footrpint and with over 100K objects this may or not be a problem).
Then the retrieve function could look like this:
def retrieve(column_name, column_value):
if column_name == "age":
return [VALUES[index] for index in AGE[column_value]]
elif ...: # repeat for other "columns"
Then, this is what you get
VALUES = [[35, "m"], [20, "f"]]
AGE = {35:[0], 20:[1]}
SEX = {"m":[0], "f":[1]}
KEYS = ["age", "sex"]
retrieve("age", 35)
# [[35, 'm']]
If you want a dictionary, you can do the following:
[dict(zip(KEYS, values)) for values in retrieve("age", 35)]
# [{'age': 35, 'sex': 'm'}]
but again, dictionaries are a little heavy on the memory side, so if you can go with lists of values it might be better.
Both dictionary and list retrieval are O(1) on average - worst case for dictionary is O(n) - so this should be pretty fast. Maintaining that will be a little bit of pain, but not so much. To "write", you'd just have to append to the VALUES list and then append the index in VALUES to each of the dictionaries.
Of course, then best would be to benchmark your actual implementation and look for potential improvements, but hopefully this make sense and will get you going :)
EDIT:
Please note that as #moooeeeep said, this will only work if your values are hashable and therefore can be used as dictionary keys.