Mapping Class Column Headers in Python SQLAlchemy from CSV import

Mapping Class Column Headers in Python SQLAlchemy from CSV import - python

I set up the column names in the class like below:
class Stat1(Base):
__tablename__ = 'stat1'
__table_args__ = {'sqlite_autoincrement': True}
id = Column(VARCHAR, primary_key=True, nullable=False)
Date_and_Time = Column(VARCHAR)
IP_Address = Column(VARCHAR)
Visitor_Label = Column(VARCHAR)
Browser = Column(VARCHAR)
Version = Column(VARCHAR)
The csv file does not use the UNDERSCORE in the column names. It is a csv file downloaded from the internet. For instance, when I import the column names headers like "Date_and_Time" are imported as "Date and Time".
I had assumed (that's wrong, right?) that the CSV's column name would map to the class column headers I set up but that's not happening and the queries are not running properly because of it. I am getting messages like this:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such
column: stat1.Date_and_Time [SQL: 'SELECT stat1.id AS stat1_id,
stat1."Date_and_Time" AS "stat1_Date_and_Time", stat1."IP_Address" AS
"stat1_IP_Address"...etc.
Is there a way to map these automatically so that queries are successful? Or a way to change the CSV's column headings automatically to insert an UNDERSCORE in the column headings to match with the columns defined in the Class?

There are a couple of different ways that you can approach this:
Implement Your Own De-serialization Logic
This means that the process of reading your CSV file and mapping its columns to your Base model class' attributes is done manually (as in your question), and then you read / map your CSV using your own custom code.
I think, in this scenario, having underscores in your model class attributes (Stat1.Date_and_Time) but not in your CSV header (...,"Date and Time",...) will complicate your code a bit. However, depending on how you've implemented your mapping code you can set your Column to use one model attribute name (Stat1.Date_and_Time)
and a different database column name (e.g. have Stat1.Date_and_Time map to your database column "Date and Time"). To accomplish this, you need to pass the name argument as below:
class Stat1(Base):
__tablename__ = 'stat1'
__table_args__ = { 'sqlite_autoincrement': True }
id = Column(name = 'id', type_ = VARCHAR, primary_key = True, nullable = False)
Date_and_Time = Column(name = 'Date and Time', type_ = VARCHAR)
IP_Address = Column(name = 'IP Address', type_ = VARCHAR)
# etc.
Now when you read records from your CSV file, you will need to load them into the appropriate model attributes in your Stat1 class. A pseudo-code example would be:
id, date_and_time, ip_address = read_csv_record(csv_record)
# Let's assume the "read_csv_record()" function reads your CSV record and returns
# the appropriate value for `id`, `Date_And_Time`, and `IP_Address`
my_record = Stat1(id = id,
Date_And_Time = date_and_time,
ip_address
# etc.)
Here, the trick is in implementing your read_csv_record() function so that it reads and returns the column values for your model attributes, so that you can then pass them appropriately to your Stat1() constructor.
Use SQLAthanor
An (I think easier) alternative to implementing your own de-serialization solution is to use a library like SQLAthanor (full disclosure: I'm the library's author, so I'm a bit biased). Using SQLAthanor, you can either:
Create your Stat model class programmatically:
from sqlathanor import generate_model_from_csv
Stat1 = generate_model_from_csv('your_csv_file.csv',
'stat1',
primary_key = 'id')
Please note, however, that if your column header names are not ANSI SQL standard column names (if they contain spaces, for example), this will likely produce an error.
Define your model, and then create instances from your CSV.
To do this, you would define your model very similarly to how you do above:
from sqlathanor import BaseModel
class Stat1(BaseModel):
__tablename__ = 'stat1'
__table_args__ = { 'sqlite_autoincrement': True }
id = Column(name = 'id', type_ = VARCHAR, primary_key = True, nullable = False, supports_csv = True, csv_sequence = 1)
Date_and_Time = Column(name = 'Date and Time', type_ = VARCHAR, supports_csv = True, csv_sequence = 2)
IP_Address = Column(name = 'IP Address', type_ = VARCHAR, supports_csv = True, csv_sequence = 3)
# etc.
The supports_csv argument tells your Stat1 class that model attribute Stat1.id can be de-serialized from (and serialized to) CSV, and the csv_sequence argument indicates that it will always be the first column in a CSV record.
Now you can create a new Stat1 instance (a record in your database) by passing your CSV record to Stat1.new_from_csv():
# let's assume you have loaded a single CSV record into a variable "csv_record"
my_record = Stat1.new_from_csv(csv_record)
and that's it! Now your my_record variable will contain an object representation of your CSV record, which you can then commit to the database if and when you choose. Since there is a wide variety of ways that CSV files can be constructed (using different delimiters, wrapping strategies, etc.) there are a large number of configuration arguments that can be supplied to .new_from_csv(), but you can find all of them documented here: https://sqlathanor.readthedocs.io/en/latest/using.html#new-from-csv
SQLAthanor is an extremely robust library for moving data into / out of CSV and SQLAlchemy, so I strongly recommend you review the documentation. Here are the important links:
Github Repo
Comprehensive Documentation
PyPi
Hope this helps!

Related

How to Make Column Names Dynamic and Deal with them As Strings in SQLAlchemy ORM?

I am learning some SQL from the book Essential SQLAlchemy by Rick Copeland
I have not really used SQL much, and relied on frameworks like Pandas and Dask for data-processing tasks. As I am going through the book, I realise all the column names of a table are often part of the table attributes, and hence seems they need to be hard-coded, instead of being dealt with as strings. Example, from the book.
#!/usr/bin/env python3
# encoding: utf-8
class LineItem(Base):
__tablename__ = 'line_items'
line_item_id = Column(Integer(), primary_key=True)
order_id = Column(Integer(), ForeignKey('orders.order_id'))
cookie_id = Column(Integer(), ForeignKey('cookies.cookie_id'))
quantity = Column(Integer())
extended_cost = Column(Numeric(12, 2))
order = relationship("Order", backref=backref('line_items',
order_by=line_item_id))
cookie = relationship("Cookie", uselist=False)
When I work with pandas dataframe, I usually deal with it like
#!/usr/bin/env python3
# encoding: utf-8
col_name:str='cookie_id'
df[col_name] # To access a column
Is there any way to make the column names in SQL-alchemy dynamic, i.e. be represented (and added to table) purely as strings, and tables be created dynamically with different column names (the strings coming from some other function or even user input etc.), that I can later access with strings as well?
Or is my expectation wrong in the sense somehow SQL is not supposed to be used like that?

It's possible to use SQLAlchemy like this*, but it's probably much more convenient to use a tool like Pandas which abstracts away all the work of defining columns and tables. If you want to try you should study the mapping documentation, which covers how model classes get mapped to database tables in detail.
Here's a fairly simple, not production-quality, example of how you might create some models and populate and query them dynamically.
import operator
import sqlalchemy as sa
from sqlalchemy import orm
metadata = sa.MetaData()
# Define some table schema data.
tables = [
{
'name': 'users',
'columns': [('id', sa.Integer, True), ('name', sa.String, False)],
},
]
# Create some table defiitions in SQLAlchemy (not in the database, yet).
for t in tables:
tbl = sa.Table(
t['name'],
metadata,
*[
sa.Column(name, type_, primary_key=pk)
for (name, type_, pk) in t['columns']
],
)
# Create some empty classes for our model(s).
classes = {t['name']: type(t['name'].title(), (object,), {}) for t in tables}
# Map model classes to tables.
mapper_registry = orm.registry(metadata=metadata)
for c in classes:
mapper_registry.map_imperatively(classes[c], metadata.tables[c.lower()])
# For convenience, make our model class a local variable.
User = classes['users']
engine = sa.create_engine('sqlite://', echo=True, future=True)
# Actually create our tables in the database.
metadata.create_all(engine)
Session = orm.sessionmaker(engine, future=True)
# Demonstrate we can insert and read data using our model.
user_data = [{'name': n} for n in ['Alice', 'Bob']]
with Session.begin() as s:
s.add_all([User(**d) for d in user_data])
# Mock user input for a simple query.
query_data = {
'column': 'name',
'op': 'eq',
'value': 'Alice',
}
with Session() as s:
equals = getattr(operator, query_data['op'])
where = equals(getattr(User, query_data['column']), query_data['value'])
q = sa.select(User).where(where)
users = s.scalars(q)
for user in users:
print(user.id, user.name)
* Up to a point: SQLAlchemy generally does not support schema changes other than creating and deleting tables.

SQL alchemy prefixes table name to columns

I am using delcarative base in sql-alchemy to query data:
from sqlalchemy.ext.declarative import declarative_base
OdsBase = declarative_base(metadata=sql.MetaData(schema='ods'))
class BagImport(OdsBase):
__tablename__ = 'bag_stg_import'
__table_args__ = {'extend_existing': True}
PAN = sql.Column(sql.String(50), primary_key = True)
GEM = sql.Column(sql.String(50))
def __repr__(self):
return "<{0} Pan: {1} - Gem: {2}>".format(self.__class__.__name__, self.PAN, self.GEM)
If I do, I get a proper result:
my_session.query(BagImport).first()
But if I want to see the query and I do:
the_query = my_session.query(BagImport)
print(the_query)
I get the output query as:
SELECT ods.bag_stg_import."PAN" AS "ods_bag_stg_import_PAN_1", ods.bag_stg_import."GEM" AS "ods_bag_stg_import_GEM_2"
FROM ods.bag_stg_import
Why is SQL-Alchemy prefixing the table name in the alias e.g. SELECT ods.bag_stg_import."PAN" AS "ods_bag_stg_import_PAN_1"?
How can I make it AS SELECT ods.bag_stg_import."PAN" AS "PAN"?

I've figured out a way to do this. In my case, I was having issues with except_, which prefixes columns with the table name. Here's how I did that:
def _except(included_query, excluded_query, Model, prefix):
"""An SQLALchemy except_ that removes the prefixes on the columns, so they can be
referenced in a subquery by their un-prefixed names."""
query = included_query.except_(excluded_query)
subquery = query.subquery()
# Get a list of columns from the subquery, relabeled with the simple column name.
columns = []
for column_name in _attribute_names(Model):
column = getattr(subquery.c, prefix + column_name)
columns.append(column.label(column_name))
# Wrap the query to select the simple column names. This is necessary because
# except_ prefixes column names with a string derived from the table name.
return Model.query.from_statement(Model.query.with_entities(*columns).statement)
Old answer:
This happened to me when using the except_ query method like this:
included_query.except_(excluded_query)
I fixed it by changing to this pattern
excluded_subquery = excluded_query.with_entities(ModelClass.id).subquery()
included_query.filter(ModelClass.id.notin_(excluded_subquery))

just mention below meta property in your model
__tablename__ = 'Users'
them sqlaclchemy will take the proper table name

How to create an index on a nested key of a JSON PostgreSQL column in SQLAlchemy?

This has been a tricky one, hope someone out there can help us all out by posting a/their method of creating an index on a nested key of a JSON (or JSONB) column in PostgreSQL using SQLAlchemy (i'm specifically using Flask-SQLAlchemy, but I do not think that will matter much for the answer).
I've tried all sorts of permutations of the index creations below and get everything from key errors, to 'c' is not an attribute, to that the operator 'getitem' is not supported on this expression.
Any help would be greatly appreciated.
# Example JSON, the nested property is "level2_A"
{
'level1': {
'level2_A': 'test value',
}
}
class TestThing(db.Model):
__tablename__ = 'test_thing'
id = db.Column(db.BigInteger(), primary_key=True)
data = db.Column(JSONB)
__table_args__ = (db.Index('ix_1', TestThing.data['level1']['level2_A']),
db.Index('ix_2', data['level1']['level2_A'].astext),
db.Index('ix_3', "TestThing.c.data['level1']['level2_A'].astext"),
db.Index('ix_4', TestThing.c.data['level1']['level2_A'].astext),
db.Index('ix_5', "test_thing.c.data['level1']['level2_A']"),
)
# db.Index('ix_1', TestThing.data['level1']['level2_A'])
# db.Index('ix_2_t', "test_thing.data['level1']['level2_A']")
# db.Index('ix_3', "TestThing.c.data['level1']['level2_A'].astext")
# db.Index('ix_4', TestThing.c.data['level1']['level2_A'].astext)
# db.Index('ix_5', "test_thing.c.data['level1']['level2_A']")

The solution I've found is using text to create a functional index.
Two example indexes here, depending on whether you want to cast the result to text or not:
from sqlalchemy.sql.expression import text
from sqlalchemy.schema import Index
class TestThing(db.Model):
__tablename__ = 'test_thing'
id = db.Column(db.BigInteger(), primary_key=True)
data = db.Column(JSONB)
__table_args__ = (
Index("ix_6", text("(data->'level1'->'level2_A')")),
Index("ix_7", text("(data->'level1'->>'level2_A')")),
)
Which results in the following SQL to create the indexes:
CREATE INDEX ix_6 ON test_thing(((data -> 'level1'::text) -> 'level2_A'::text) jsonb_ops);
CREATE INDEX ix_7 ON test_thing(((data -> 'level1'::text) ->> 'level2_A'::text) text_ops);

class TestThing(db.Model):
__tablename__ = 'test_thing'
id = db.Column(db.BigInteger(), primary_key=True)
data = db.Column(JSONB)
__table_args__ = (
Index("ix_7", "(data->'level1'->>'level2_A')"),
)
This should ideally work without the text()
because -> returns json(b) and ->> returns text:
The query which will be generated will be
CREATE INDEX ix_7 ON test_thing(((data->'level1')->>'level2_A') text_ops);

Mapping lots of similar tables in SQLAlchemy

I have many (~2000) locations with time series data. Each time series has millions of rows. I would like to store these in a Postgres database. My current approach is to have a table for each location time series, and a meta table which stores information about each location (coordinates, elevation etc). I am using Python/SQLAlchemy to create and populate the tables. I would like to have a relationship between the meta table and each time series table to do queries like "select all locations that have data between date A and date B" and "select all data for date A and export a csv with coordinates". What is the best way to create many tables with the same structure (only the name is different) and have a relationship with a meta table? Or should I use a different database design?
Currently I am using this type of approach to generate a lot of similar mappings:
from sqlalchemy import create_engine, MetaData
from sqlalchemy.types import Float, String, DateTime, Integer
from sqlalchemy import Column, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref
Base = declarative_base()
def make_timeseries(name):
class TimeSeries(Base):
__tablename__ = name
table_name = Column(String(50), ForeignKey('locations.table_name'))
datetime = Column(DateTime, primary_key=True)
value = Column(Float)
location = relationship('Location', backref=backref('timeseries',
lazy='dynamic'))
def __init__(self, table_name, datetime, value):
self.table_name = table_name
self.datetime = datetime
self.value = value
def __repr__(self):
return "{}: {}".format(self.datetime, self.value)
return TimeSeries
class Location(Base):
__tablename__ = 'locations'
id = Column(Integer, primary_key=True)
table_name = Column(String(50), unique=True)
lon = Column(Float)
lat = Column(Float)
if __name__ == '__main__':
connection_string = 'postgresql://user:pw#localhost/location_test'
engine = create_engine(connection_string)
metadata = MetaData(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()
TS1 = make_timeseries('ts1')
# TS2 = make_timeseries('ts2') # this breaks because of the foreign key
Base.metadata.create_all(engine)
session.add(TS1("ts1", "2001-01-01", 999))
session.add(TS1("ts1", "2001-01-02", -555))
qs = session.query(Location).first()
print qs.timeseries.all()
This approach has some problems, most notably that if I create more than one TimeSeries the foreign key doesn't work. Previously I've used some work arounds, but it all seems like a big hack and I feel that there must be a better way of doing this. How should I organise and access my data?

Alternative-1: Table Partitioning
Partitioning immediately comes to mind as soon as I read exactly the same table structure. I am not a DBA, and do not have much production experience using it (even more so on PostgreSQL), but
please read PostgreSQL - Partitioning documentation. Table partitioning seeks to solve exactly the problem you have, but over 1K tables/partitions sounds challenging; therefore please do more research on forums/SO for scalability related questions on this topic.
Given that both of your mostly used search criterias, datetime component is very important, therefore there must be solid indexing strategy on it. If you decide to go with partitioning root, the obvious partitioning strategy would be based on date ranges. This might allow you to partition older data in different chunks compared to most recent data, especially assuming that old data is (almost never) updated, so physical layouts would be dense and efficient; while you could employ another strategy for more "recent" data.
Alternative-2: trick SQLAlchemy
This basically makes your sample code work by tricking SA to assume that all those TimeSeries are children of one entity using Concrete Table Inheritance. The code below is self-contained and creates 50 table with minimum data in it. But if you have a database already, it should allow you to check the performance rather quickly, so that you can make a decision if it is even a close possibility.
from datetime import date, datetime
from sqlalchemy import create_engine, Column, String, Integer, DateTime, Float, ForeignKey, func
from sqlalchemy.orm import sessionmaker, relationship, configure_mappers, joinedload
from sqlalchemy.ext.declarative import declarative_base, declared_attr
from sqlalchemy.ext.declarative import AbstractConcreteBase, ConcreteBase
engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base(engine)
# MODEL
class Location(Base):
__tablename__ = 'locations'
id = Column(Integer, primary_key=True)
table_name = Column(String(50), unique=True)
lon = Column(Float)
lat = Column(Float)
class TSBase(AbstractConcreteBase, Base):
#declared_attr
def table_name(cls):
return Column(String(50), ForeignKey('locations.table_name'))
def make_timeseries(name):
class TimeSeries(TSBase):
__tablename__ = name
__mapper_args__ = { 'polymorphic_identity': name, 'concrete':True}
datetime = Column(DateTime, primary_key=True)
value = Column(Float)
def __init__(self, datetime, value, table_name=name ):
self.table_name = table_name
self.datetime = datetime
self.value = value
return TimeSeries
def _test_model():
_NUM = 50
# 0. generate classes for all tables
TS_list = [make_timeseries('ts{}'.format(1+i)) for i in range(_NUM)]
TS1, TS2, TS3 = TS_list[:3] # just to have some named ones
Base.metadata.create_all()
print('-'*80)
# 1. configure mappers
configure_mappers()
# 2. define relationship
Location.timeseries = relationship(TSBase, lazy="dynamic")
print('-'*80)
# 3. add some test data
session.add_all([Location(table_name='ts{}'.format(1+i), lat=5+i, lon=1+i*2)
for i in range(_NUM)])
session.commit()
print('-'*80)
session.add(TS1(datetime(2001,1,1,3), 999))
session.add(TS1(datetime(2001,1,2,2), 1))
session.add(TS2(datetime(2001,1,2,8), 33))
session.add(TS2(datetime(2002,1,2,18,50), -555))
session.add(TS3(datetime(2005,1,3,3,33), 8))
session.commit()
# Query-1: get all timeseries of one Location
#qs = session.query(Location).first()
qs = session.query(Location).filter(Location.table_name == "ts1").first()
print(qs)
print(qs.timeseries.all())
assert 2 == len(qs.timeseries.all())
print('-'*80)
# Query-2: select all location with data between date-A and date-B
dateA, dateB = date(2001,1,1), date(2003,12,31)
qs = (session.query(Location)
.join(TSBase, Location.timeseries)
.filter(TSBase.datetime >= dateA)
.filter(TSBase.datetime <= dateB)
).all()
print(qs)
assert 2 == len(qs)
print('-'*80)
# Query-3: select all data (including coordinates) for date A
dateA = date(2001,1,1)
qs = (session.query(Location.lat, Location.lon, TSBase.datetime, TSBase.value)
.join(TSBase, Location.timeseries)
.filter(func.date(TSBase.datetime) == dateA)
).all()
print(qs)
# #note: qs is list of tuples; easy export to CSV
assert 1 == len(qs)
print('-'*80)
if __name__ == '__main__':
_test_model()
Alternative-3: a-la BigData
If you do get into performance problems using database, I would probably try:
still keep the data in separate tables/databases/schemas like you do right now
bulk-import data using "native" solutions provided by your database engine
use MapReduce-like analysis.
Here I would stay with python and sqlalchemy and implemnent own distributed query and aggregation (or find something existing). This, obviously, only works if you do not have requirement to produce those results directly on the database.
edit-1: Alternative-4: TimeSeries databases
I have no experience using those on a large scale, but definitely an option worth considering.
Would be fantastic if you could later share your findings and whole decision-making process on this.

I would avoid the database design you mention above. I don't know enough about the data you are working with, but it sounds like you should have two tables. One table for location, and a child table for location_data. The location table would store the data you mention above such as coordinates and elevations. The location_data table would store the location_id from the location table as well as the time series data you want to track.
This would eliminate changing db structure and code changes every time you add another location, and would allow the types of queries you are looking at doing.

Two parts:
only use two tables
there's no need to have dozens or hundreds of identical tables. just have a table for location and one for location_data , where every entry will fkey onto location. also create an index on the location_data table for the location_id, so you have efficient searching.
don't use sqlalchemy to create this
i love sqlalchemy. i use it every day. it's great for managing your database and adding some rows, but you don't want to use it for initial setup that has millions of rows. you want to generate a file that is compatible with postgres' "COPY" statement [ http://www.postgresql.org/docs/9.2/static/sql-copy.html ] COPY will let you pull in a ton of data fast; it's what is used during dump/restore operations.
sqlalchemy will be great for querying this and adding rows as they come in. if you have bulk operations, you should use COPY.

Updating dynamically determined fields with peewee

I have a peewee model like the following:
class Parrot(Model):
is_alive = BooleanField()
bought = DateField()
color = CharField()
name = CharField()
id = IntegerField()
I get this data from the user and look for the corresponding id in the (MySQL) database. What I want to do now is to update those attributes which are not set/empty at the moment. For example, if the new data has the following attributes:
is_alive = True
bought = '1965-03-14'
color = None
name = 'norwegian'
id = 17
and the data from the database has:
is_alive = False
bought = None
color = 'blue'
name = ''
id = 17
I would like to update the bought date and the name (which are not set or empty), but without changing the is_alive status. In this case, I could get the new and old data in separate class instances, manually create a list of attributes and compare them one for one, updating where necessary, and finally saving the result to the database. However, I feel there might be a better way for handling this, which could also be used for any class with any attributes. Is there?

MySQL Solution:
UPDATE my_table SET
bought = ( case when bought is NULL OR bought = '' ) then ? end )
, name = ( case when name is NULL OR name = '' ) then ? end )
-- include other field values if any, here
WHERE
id = ?
Use your scripting language to set the parameter values.
In case of the parameters matching the old values, then update will not be performed, by default.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mapping Class Column Headers in Python SQLAlchemy from CSV import - python

Related

How to Make Column Names Dynamic and Deal with them As Strings in SQLAlchemy ORM?

SQL alchemy prefixes table name to columns

How to create an index on a nested key of a JSON PostgreSQL column in SQLAlchemy?

Mapping lots of similar tables in SQLAlchemy

Updating dynamically determined fields with peewee

Categories

Resources