I have many (~2000) locations with time series data. Each time series has millions of rows. I would like to store these in a Postgres database. My current approach is to have a table for each location time series, and a meta table which stores information about each location (coordinates, elevation etc). I am using Python/SQLAlchemy to create and populate the tables. I would like to have a relationship between the meta table and each time series table to do queries like "select all locations that have data between date A and date B" and "select all data for date A and export a csv with coordinates". What is the best way to create many tables with the same structure (only the name is different) and have a relationship with a meta table? Or should I use a different database design?
Currently I am using this type of approach to generate a lot of similar mappings:
from sqlalchemy import create_engine, MetaData
from sqlalchemy.types import Float, String, DateTime, Integer
from sqlalchemy import Column, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref
Base = declarative_base()
def make_timeseries(name):
class TimeSeries(Base):
__tablename__ = name
table_name = Column(String(50), ForeignKey('locations.table_name'))
datetime = Column(DateTime, primary_key=True)
value = Column(Float)
location = relationship('Location', backref=backref('timeseries',
lazy='dynamic'))
def __init__(self, table_name, datetime, value):
self.table_name = table_name
self.datetime = datetime
self.value = value
def __repr__(self):
return "{}: {}".format(self.datetime, self.value)
return TimeSeries
class Location(Base):
__tablename__ = 'locations'
id = Column(Integer, primary_key=True)
table_name = Column(String(50), unique=True)
lon = Column(Float)
lat = Column(Float)
if __name__ == '__main__':
connection_string = 'postgresql://user:pw#localhost/location_test'
engine = create_engine(connection_string)
metadata = MetaData(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()
TS1 = make_timeseries('ts1')
# TS2 = make_timeseries('ts2') # this breaks because of the foreign key
Base.metadata.create_all(engine)
session.add(TS1("ts1", "2001-01-01", 999))
session.add(TS1("ts1", "2001-01-02", -555))
qs = session.query(Location).first()
print qs.timeseries.all()
This approach has some problems, most notably that if I create more than one TimeSeries the foreign key doesn't work. Previously I've used some work arounds, but it all seems like a big hack and I feel that there must be a better way of doing this. How should I organise and access my data?
Alternative-1: Table Partitioning
Partitioning immediately comes to mind as soon as I read exactly the same table structure. I am not a DBA, and do not have much production experience using it (even more so on PostgreSQL), but
please read PostgreSQL - Partitioning documentation. Table partitioning seeks to solve exactly the problem you have, but over 1K tables/partitions sounds challenging; therefore please do more research on forums/SO for scalability related questions on this topic.
Given that both of your mostly used search criterias, datetime component is very important, therefore there must be solid indexing strategy on it. If you decide to go with partitioning root, the obvious partitioning strategy would be based on date ranges. This might allow you to partition older data in different chunks compared to most recent data, especially assuming that old data is (almost never) updated, so physical layouts would be dense and efficient; while you could employ another strategy for more "recent" data.
Alternative-2: trick SQLAlchemy
This basically makes your sample code work by tricking SA to assume that all those TimeSeries are children of one entity using Concrete Table Inheritance. The code below is self-contained and creates 50 table with minimum data in it. But if you have a database already, it should allow you to check the performance rather quickly, so that you can make a decision if it is even a close possibility.
from datetime import date, datetime
from sqlalchemy import create_engine, Column, String, Integer, DateTime, Float, ForeignKey, func
from sqlalchemy.orm import sessionmaker, relationship, configure_mappers, joinedload
from sqlalchemy.ext.declarative import declarative_base, declared_attr
from sqlalchemy.ext.declarative import AbstractConcreteBase, ConcreteBase
engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base(engine)
# MODEL
class Location(Base):
__tablename__ = 'locations'
id = Column(Integer, primary_key=True)
table_name = Column(String(50), unique=True)
lon = Column(Float)
lat = Column(Float)
class TSBase(AbstractConcreteBase, Base):
#declared_attr
def table_name(cls):
return Column(String(50), ForeignKey('locations.table_name'))
def make_timeseries(name):
class TimeSeries(TSBase):
__tablename__ = name
__mapper_args__ = { 'polymorphic_identity': name, 'concrete':True}
datetime = Column(DateTime, primary_key=True)
value = Column(Float)
def __init__(self, datetime, value, table_name=name ):
self.table_name = table_name
self.datetime = datetime
self.value = value
return TimeSeries
def _test_model():
_NUM = 50
# 0. generate classes for all tables
TS_list = [make_timeseries('ts{}'.format(1+i)) for i in range(_NUM)]
TS1, TS2, TS3 = TS_list[:3] # just to have some named ones
Base.metadata.create_all()
print('-'*80)
# 1. configure mappers
configure_mappers()
# 2. define relationship
Location.timeseries = relationship(TSBase, lazy="dynamic")
print('-'*80)
# 3. add some test data
session.add_all([Location(table_name='ts{}'.format(1+i), lat=5+i, lon=1+i*2)
for i in range(_NUM)])
session.commit()
print('-'*80)
session.add(TS1(datetime(2001,1,1,3), 999))
session.add(TS1(datetime(2001,1,2,2), 1))
session.add(TS2(datetime(2001,1,2,8), 33))
session.add(TS2(datetime(2002,1,2,18,50), -555))
session.add(TS3(datetime(2005,1,3,3,33), 8))
session.commit()
# Query-1: get all timeseries of one Location
#qs = session.query(Location).first()
qs = session.query(Location).filter(Location.table_name == "ts1").first()
print(qs)
print(qs.timeseries.all())
assert 2 == len(qs.timeseries.all())
print('-'*80)
# Query-2: select all location with data between date-A and date-B
dateA, dateB = date(2001,1,1), date(2003,12,31)
qs = (session.query(Location)
.join(TSBase, Location.timeseries)
.filter(TSBase.datetime >= dateA)
.filter(TSBase.datetime <= dateB)
).all()
print(qs)
assert 2 == len(qs)
print('-'*80)
# Query-3: select all data (including coordinates) for date A
dateA = date(2001,1,1)
qs = (session.query(Location.lat, Location.lon, TSBase.datetime, TSBase.value)
.join(TSBase, Location.timeseries)
.filter(func.date(TSBase.datetime) == dateA)
).all()
print(qs)
# #note: qs is list of tuples; easy export to CSV
assert 1 == len(qs)
print('-'*80)
if __name__ == '__main__':
_test_model()
Alternative-3: a-la BigData
If you do get into performance problems using database, I would probably try:
still keep the data in separate tables/databases/schemas like you do right now
bulk-import data using "native" solutions provided by your database engine
use MapReduce-like analysis.
Here I would stay with python and sqlalchemy and implemnent own distributed query and aggregation (or find something existing). This, obviously, only works if you do not have requirement to produce those results directly on the database.
edit-1: Alternative-4: TimeSeries databases
I have no experience using those on a large scale, but definitely an option worth considering.
Would be fantastic if you could later share your findings and whole decision-making process on this.
I would avoid the database design you mention above. I don't know enough about the data you are working with, but it sounds like you should have two tables. One table for location, and a child table for location_data. The location table would store the data you mention above such as coordinates and elevations. The location_data table would store the location_id from the location table as well as the time series data you want to track.
This would eliminate changing db structure and code changes every time you add another location, and would allow the types of queries you are looking at doing.
Two parts:
only use two tables
there's no need to have dozens or hundreds of identical tables. just have a table for location and one for location_data , where every entry will fkey onto location. also create an index on the location_data table for the location_id, so you have efficient searching.
don't use sqlalchemy to create this
i love sqlalchemy. i use it every day. it's great for managing your database and adding some rows, but you don't want to use it for initial setup that has millions of rows. you want to generate a file that is compatible with postgres' "COPY" statement [ http://www.postgresql.org/docs/9.2/static/sql-copy.html ] COPY will let you pull in a ton of data fast; it's what is used during dump/restore operations.
sqlalchemy will be great for querying this and adding rows as they come in. if you have bulk operations, you should use COPY.
Related
I am building a system for experiment data organization. What is the most efficient way to insert new entries to a table that has a many-to-many relationship with another table?
Simple example
I have two tables: Subject & Task. Subjects can run multiple Tasks (experiments) and Tasks can be run by many Subjects. They are associated by an association table r_Subject_Task.
Questions
How would I insert multiple entries at a time into Subjects that share the same task that does not already exist in the Task database?
Q1 but if the Task does already exist in the Task database?
How can I add a Task that relates to an existing Subject?
Code Example
Setup engine and classes
import sqlalchemy as db
from sqlalchemy import Column, ForeignKey, Integer, String, Table
from sqlalchemy.orm import declarative_base, relationship
Base = declarative_base()
engine = db.create_engine('sqlite:///test.db')
r_Subject_Task = Table(
'r_subject_task',
Base.metadata,
Column('subject_id', ForeignKey('subject.id'), primary_key=True),
Column('task_id', ForeignKey('task.id'), primary_key=True)
)
class Subject (Base):
__tablename__ = 'subject'
id = Column(String(30), primary_key = True)
tasks = relationship("Task", secondary=r_Subject_Task, back_populates='subjects')
def __repr__(self):
return f'Subject(id = {self.id}), tasks = {self.tasks}'
class Task (Base):
__tablename__ = 'task'
id = Column(String(30), primary_key=True)
subjects = relationship("Subject", secondary=r_Subject_Task, back_populates='tasks')
def __repr__(self):
return f'Task(id = {self.id}, subjects = {self.subjects})'
Base.metadata.create_all(engine)
Great, the tables are set up. Now I'd like to insert data.
from sqlalchemy.orm import Session
with Session(engine) as session:
subject001 = Subject(
id = 'subj001',
tasks = [Task(id = 'task1')]
)
subject002 = Subject(
id = 'subj002',
tasks = [Task(id = 'task1'), Task(id = 'task2')]
)
subject003 = Subject(
id = 'subj003'
)
task3 = Task(
id = 'task3'
subjects = [Subject(id = 'subj003')]
)
session.add_all([subject001, subject002, subject003, task3])
session.commit()
This results in two errors.
Error1: committing multiple subjects with matching tasks. When I try to create multiple Subjects with matching Tasks, the UNIQUE constraint is failed; I am trying to pass in multiple instances of 'task1'.
Error2: committing a task with an existing subject. When I try to create Task 'task3' and apply it to 'subj004', the UNIQUE constraint is failed; I am trying to add a subject that already exists.
What is the most efficient way to insert many-to-many data dynamically? Should I go one by one? This seems like a standard use case but I'm having difficulty finding good resources in the docs. Any links to relevant tutorials or examples would be much appreciated.
Thanks
With a SQLAlchemy query like:
result = db.session.query(Model).add_columns(
func.min(Model.foo).over().label("min_foo"),
func.max(Model.foo).over().label("max_foo"),
# ...
)
The result is an iterable of tuples, consisting of firstly the Model row, and then the added columns.
How can I either:
Contribute the added columns to Model, such that they can be accessed from each element as model.min_foo et al.; or
Map the added columns into a separate dataclass, such that they can be accessed as e.g. extra.min_foo?
The main thing I'm trying to achieve here is access by name - such as the given labels - without enumerating them all as model, min_foo, max_foo, ... and relying on maintaining the same order. With model, *extra, extra is just a plain list of the aggregate values, there's no reference to the label.
If I dynamically add the columns to the model first:
Model.min_foo = Column(Numeric)
then it complains:
Implicitly combining column modeltable.min_foo with column modeltable.min_foo under attribute 'min_foo'.
Please configure one or more attributes for these same-named columns explicitly
Apparently the solution to that is to explicitly join the tables. But this isn't one!
It seems that this ought to be possible with 'mappers', but I can't find any examples that don't explicitly map to a 'table name' or its columns, which I don't really have here - it's not clear to me if/how they can be used with aggregates, or other 'virtual' columns from the query that aren't actually stored in any table.
I think that what you are looking for is a Query-time SQL expressions as mapped attributes:
from sqlalchemy import create_engine, Column, Integer, select, func
from sqlalchemy.orm import (Session, declarative_base, query_expression,
with_expression)
Base = declarative_base()
class Model(Base):
__tablename__ = 'model'
id = Column(Integer, primary_key=True)
foo = Column(Integer)
foo2 = Column(Integer, default=0)
engine = create_engine('sqlite:///', future=True)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add(Model(foo=10))
session.add(Model(foo=20))
session.add(Model(foo=30))
session.add(Model(foo=40))
session.add(Model(foo=50, foo2=1))
session.add(Model(foo=60, foo2=1))
session.add(Model(foo=70, foo2=1))
session.add(Model(foo=80))
session.add(Model(foo=90))
session.add(Model(foo=100))
session.commit()
Model.min_foo = query_expression(func.min(Model.foo).over())
stmt = select(Model).where(Model.foo2 == 1)
models = session.execute(stmt).all()
for model, in models:
print(model.min_foo)
with Session(engine) as session:
Model.max_foo = query_expression()
stmt = select(Model).options(with_expression(Model.max_foo,
func.max(Model.foo).over())
).where(Model.foo2 == 1)
models = session.execute(stmt).all()
for model, in models:
print(model.max_foo)
You can define a default expression when defining the query_expression or using .options with with_expression you can define a runtime expression. The only thing is that the Mapped attribute cannot be unmapped and will return None for max_foo as there is no default expression defined.
I have a table of time series data that I frequently need to get records where the date is equal to the max date in the table. In SQL this is easily accomplished via subquery, i.e.:
SELECT * from my_table where date = (select max(date) from my_table);
The model for this table would look like:
class MyTable(Base):
__tablename__ = 'my_table'
id = Column(Integer, primary_key = True)
date = Column(Date)
And I can accomplish the desired behavior in SQLAlchemy with two separate queries, ie:
maxdate = session.query(func.max(MyTable.date)).first()[0]
desired_results = session.query(MyTable).filter(MyTable.date == maxdate).all()
The problem is that I have this subquery sprinkled everywhere in my code and I feel it is an inelegant solution. Ideally I would like to write a class property or custom comparator that I can stick in the model definition, so that I can compress the subquery into a single line and reuse it constantly, something like:
session.query(MyTable).filter(MyTable.date == MyTable.max_date)
I have looked through the SQLAlchemy docs on this but haven't come up with anything that works. Does anybody have neat a solution for this kind of problem?
For posterity, here is the solution I came up with
from sqlalchemy.sql import func
from sqlalchemy import select
class MyTable(Base):
__tablename__ = 'my_table'
id = Column(Integer, primary_key = True)
date = Column(Date)
maxdate = select([func.max(date)])
desired_results = session.query(MyTable).filter(MyTable.date == MyTable.maxdate).all()
Maybe my previous question was too much long and endless to answer, sorry for that... I will try to be more specific shortening my previous question
I can extract from an API query (json format as output) the following information:
GENE1
Experiment1
Experiment2
Experiment3
Experiment4
GENE2
Experiment5
Experiment2
Experiment3
Experiment8
Experiment9
[...]
So I obtain genes and their related experiments in which they have been studied... One gene can have more than one experiment, and 1 experiment can have more than one gene (many to many)
I have this schema in SQL Alchemy:
from sqlalchemy import create_engine, Column, Integer, String, Date, ForeignKey, Table, Float
from sqlalchemy.orm import sessionmaker, relationship, backref
from sqlalchemy.ext.declarative import declarative_base
import requests
Base = declarative_base()
Genes2experiments = Table('genes2experiments',Base.metadata,
Column('gene_id', String, ForeignKey('genes.id')),
Column('experiment_id', String, ForeignKey('experiments.id'))
)
class Genes(Base):
__tablename__ = 'genes'
id = Column(String(45), primary_key=True)
experiments = relationship("Experiments", secondary=Genes2experiments, backref="genes")
def __init__(self, id=""):
self.id= id
def __repr__(self):
return "<genes(id:'%s')>" % (self.id)
class Experiments(Base):
__tablename__ = 'experiments'
id = Column(String(45), primary_key=True)
def __init__(self, id=""):
self.id= id
def __repr__(self):
return "<experiments(id:'%s')>" % (self.id)
def setUp():
global Session
engine=create_engine('mysql://root:password#localhost/db_name?charset=utf8', pool_recycle=3600,echo=False)
Session=sessionmaker(bind=engine)
def add_data():
session=Session()
for i in range(0,1000,200):
request= requests.get('http://www.ebi.ac.uk/gxa/api/v1',params={"updownInOrganism_part":"brain","rows":200,"start":i})
result = request.json
for item in result['results']:
gene_to_add = item['gene']['ensemblGeneId']
session.commit()
session.close()
setUp()
add_data()
With this code I just add to my database all the genes from the API query to the Genes table...
1st question: how and when should I add the experiments information to keep their relationship someway???
2nd question: should I add a new secondary relationship in the Experiments class, as in the Genes class, or is it enough putting just one?
Thank you
(for more context/info: my previous question)
Whenever you records the results of an experiment, or even when you plan an experiment, you can already add instances to the database and the relationships as well.
having backref will effectively add the other side of the relationship, so that having an instance of Experiments, you can get the Genes[] via my_experiment.genes
Note: I would remove plural from the names of your entities: class Gene, class Experiment instead of class Genes, class Experiments.
I am a relative newcomer to SQLAlchemy and have read the basic docs. I'm currently following Mike Driscoll's MediaLocker tutorial and modifying/extending it for my own purpose.
I have three tables (loans, people, cards). Card to Loan and Person to Loan are both one-to-many relationships and modelled as such:
from sqlalchemy import Table, Column, DateTime, Integer, ForeignKey, Unicode
from sqlalchemy.orm import backref, relation
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine("sqlite:///cardsys.db", echo=True)
DeclarativeBase = declarative_base(engine)
metadata = DeclarativeBase.metadata
class Loan(DeclarativeBase):
"""
Loan model
"""
__tablename__ = "loans"
id = Column(Integer, primary_key=True)
card_id = Column(Unicode, ForeignKey("cards.id"))
person_id = Column(Unicode, ForeignKey("people.id"))
date_issued = Column(DateTime)
date_due = Column(DateTime)
date_returned = Column(DateTime)
issue_reason = Column(Unicode(50))
person = relation("Person", backref="loans", cascade_backrefs=False)
card = relation("Card", backref="loans", cascade_backrefs=False)
class Card(DeclarativeBase):
"""
Card model
"""
__tablename__ = "cards"
id = Column(Unicode(50), primary_key=True)
active = Column(Boolean)
class Person(DeclarativeBase):
"""
Person model
"""
__tablename__ = "people"
id = Column(Unicode(50), primary_key=True)
fname = Column(Unicode(50))
sname = Column(Unicode(50))
When I try to create a new loan (using the below method in my controller) it works fine for unique cards and people, but once I try to add a second loan for a particular person or card it gives me a "non-unique" error. Obviously it's not unique, that's the point, but I thought SQLAlchemy would take care of the behind-the-scenes stuff for me, and add the correct existing person or card id as the FK in the new loan, rather than trying to create new person and card records. Is it up to me to query to the db to check PK uniqueness and handle this manually? I got the impression this should be something SQLAlchemy might be able to handle automatically?
def addLoan(session, data):
loan = Loan()
loan.date_due = data["loan"]["date_due"]
loan.date_issued = data["loan"]["date_issued"]
loan.issue_reason = data["loan"]["issue_reason"]
person = Person()
person.id = data["person"]["id"]
person.fname = data["person"]["fname"]
person.sname = data["person"]["sname"]
loan.person = person
card = Card()
card.id = data["card"]["id"]
loan.card = card
session.add(loan)
session.commit()
In the MediaLocker example new rows are created with an auto-increment PK (even for duplicates, not conforming to normalisation rules). I want to have a normalised database (even in a small project, just for best practise in learning) but can't find any examples online to study.
How can I achieve the above?
It's up to you to retrieve and assign the existing Person or Card object to the relationship before attempting to add a new one with a duplicate primary key. You can do this with a couple of small changes to your code.
def addLoan(session, data):
loan = Loan()
loan.date_due = data["loan"]["date_due"]
loan.date_issued = data["loan"]["date_issued"]
loan.issue_reason = data["loan"]["issue_reason"]
person = session.query(Person).get(data["person"]["id"])
if not person:
person = Person()
person.id = data["person"]["id"]
person.fname = data["person"]["fname"]
person.sname = data["person"]["sname"]
loan.person = person
card = session(Card).query.get(data["card"]["id"])
if not card:
card = Card()
card.id = data["card"]["id"]
loan.card = card
session.add(loan)
session.commit()
There are also some solutions for get_or_create functions, if you want to wrap it into one step.
If you're loading large numbers of records into a new database from scratch, and your query is more complex than a get (the session object is supposed to cache get lookups on its own), you could avoid the queries altogether at the cost of memory by adding each new Person and Card object to a temporary dict by ID, and retrieving the existing objects there instead of hitting the database.