Adding, removing, and comparing values to column with Python SQLAlchemy

Adding, removing, and comparing values to column with Python SQLAlchemy - python

SQLAlchemy community, a noob in database and specifically sqlalchemy is seeking your help here. As one would expects, my database consists of rows and columns. Each row is the information about one unique person. Each person has multiple columns (date of birth, name, last name, previous log-in dates, etc.) For one of these columns (previous log-in dates), I would like to store multiple values inside a single cell. In other words, I would like to be able to store the last, let's say ten, log-in dates and be able to manipulate these dates the same way that one would manipulate a list in python. I would like to be able to append new log-in dates to this cell, remove items from the cell, and access specific index of the cell. Basically my cell would like something like this
{"04042020","04052020","04072020"}
And my database would look like this
Name | Last Name | Last Log in dates
------------------------------------------------------------
Edgar | Allen | {"04042020","04052020","04072020"}
Dimitri | Albertini | {"12042019","10112019","01072020"}
I know that sqlalchemy has a way of incorporating ARRAYs with
from sqlalchemy.dialects.postgresql import ARRAY
After some efforts, I was just merely able to create an ARRAY and I could NOT figure out a way to manipulate(append, remove, access) the array. Here is a simple prototype that create a table with just one column and the element in the first row equals {1,2,3}.
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
from sqlalchemy.dialects.postgresql import ARRAY
engine = create_engine('postgresql://rouzbeh:tiger#localhost/newtable')
metadata = MetaData()
newtable = Table("newtable", metadata,
Column("data", ARRAY(Integer))
)
metadata.create_all(engine)
connection = engine.connect()
connection.execute(newtable.insert(),data=[1,2,3])
Output in Postgress postico looks like the following screenshot. Again to reiterate, I would like to be able to access the elements ({1,2,3}) and manipulate them by removing or adding elements to it. ({1,2,3,4}) or ({1,2})

For a similar use case I am storing such data as JSON. Check out this SO post. The idea is to dump JSON as TEXT (or a JSON object, if your database is supporting this).
For more complex structures I use Marshmallow. It's especially useful with nested structures.
However, this approach is controversial. See this SO post and in this Quora discussion. If you can live with the downsides, it is an easy way to store data.

Related

SQLAlchemy sqlite3 remove value from JSON column on multiple rows with different JSON values

Say I have an id column that is saved as ids JSON NOT NULL using SQLAlchemy, and now I want to delete an id from this column. I'd like to do several things at once:
query only the rows who have this specific ID
delete this ID from all rows it appears in
a bonus, if possible - delete the row if the ID list is now empty.
For the query, something like this:
db.query(models.X).filter(id in list(models.X.ids)) should work.
now, I'd rather avoid iterating over each query and then send an update request as it can be multiple rows. Is there any elegant way to do this?
Thanks!

For the search and remove remove part you can use json_remove function (from SQLLite built-in functions)
from sqlalchemy import func
db.query(models.X).update({'ids': func.json_remove(models.X.ids,f'$[{TARGET_ID}]') })
Here replace TARGET_ID by the targeted id.
Now this will update the row 'silently' (wether or not this id is present in the array).
If you want to first check if target id is in the column: you can query first all rows containing the target id with json_extract query (calling .all() method and then remove those ids with an .update() call.
But this will cost you double amount of queries (less performant).
For the delete part, you can use the json_array_length built-in function
from sqlalchemy import func
db.query(models.X).filter(func.json_array_length(models.X.ids) == 0).delete()
FYI : Not sure that you can do both in one query, and even if possible, I would not do it for clean syntax, logging and monitoring reasons.

sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "product_pkey" [duplicate]

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?

You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

How can I put the the list type of item into database?

I use Scrapy to write a spider to get something from a website.And I want to put the item into database.In my code,there are five items ,two of items are unicode type,so I can put it into database directily,but two of items are list type,How can I put it into database?Here is my code about the items whose type are list:
descr = sel.xpath(
'//*[#id="root"]/div/main/div/div[2]/div[1]/div[2]/div/div/div/div[2]/div/div/span/p[1]/text()').extract()
print 'type is:', type(descr)
answer_time = sel.xpath(
'//*[#id="root"]/div/main/div/div[2]/div[1]/div[2]/div/div/div/div[2]/div/div/div/a/span/#data-tooltip').extract()
print 'type is:', type(answer_time)

It should depend strongly on how the data will be used once in the db and what kind of db you're using.
If you're using something like MongoDB, it fully supports just adding the list as part of the record. Some relational dbs such as postgresql have support for JSON column types.
Beyond that there are two main options. 1) Cast the list to a string using something like JSON and save it in a text column along with your other unicode columns. 2) Take full advantage of a relational db and use a second table to store a one:many relationship.
Casting it a string is much easier/faster to dev. Great for rapid prototyping. However, using multiple tables has huge efficiency perks if the data is going to be read for analysis on any data set that is not small.

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?

You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.

Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

Automatically merging arbitrary Django models

I have two Django-ORM managed databases that I'd like to merge. Both have a very similar schema, and both have the standard auth_users table, along with a few other shared tables that reference each other as well as auth_users, which I'd like to merge into a single database automatically.
Understandably, this could be very non-trivial depending upon the foreign-key relationships, and what constitutes a "unique" record in each table.
Does anyone know if there exists a tool to do this merge operation?
If nothing like this currently exists, I was considering writing my own management command, based on the standard loaddata command. Essentially, you'd use the standard dumpdata command to export tables from a source database, and then use a modified version of loaddata to "merge" them into the destination database.
For example, if I have databases A and B, and I want to merge database B into database A, then I'd want to follow a procedure according to the pseudo-code:
merge_database_dst = A
merge_database_src = B
for table in sorted(merge_database_dst.get_redundant_tables(merge_database_src), key=acyclic_dependency):
key = table.get_unique_column_key()
src_id_to_dst_id = {}
for record_src in merge_database_src.table.objects.all():
src_key_value = record_src.get_key_value(key)
try:
record_dst = merge_database_dst.table.objects.get(key)
dst_key_value = record_dst.get_key_value(key)
except merge_database_dst.table.DoesNotExist:
record_dst = merge_database_dst.table(**[(k,convert_fk(v)) for k,v in record_src._meta.fields])
record_dst.save()
dst_key_value = record_dst.get_key_value(key)
src_id_to_dst_id[(table,record_src.id)] = record_dst.id
The convert_fk() function would use the src_id_to_dst_id index to convert foreign key references in the source table to the equivalent IDs in the destination table.
To summarize, the algorithm would iterate over the table to be merged in the order of dependency, with parents iterated over first. So if we wanted to merge tables auth_users and mycustomprofile, which is dependent on auth_users, we'd iterate ['auth_users','mycustomprofile'].
Each merged table would need some sort of indicator documenting the combination of columns that denotes a universally unique record (i.e. the "key"). For auth_users, that might be the "username" and/or "email" column.
If the value of the key in database B already exists in A, then the record is not imported from B, but the ID of the existing record in A is recorded.
If the value of the key in database B does not exist in A, then the record is imported from B, and the ID of the new record is recorded.
Using the previously recorded ID, a mapping is created, explaining how to map foreign-key references to that specific record in B to the new merged/pre-existing record in A. When future records are merged into A, this mapping would be used to convert the foreign keys.
I could still envision some cases where an imported record references a table not included in the dumpdata, which might cause the entire import to fail, therefore some sort of "dryrun" option would be needed to simulate the import to ensure all FK references can be translated.
Does this seem like a practical approach? Is there a better way?
EDIT: This isn't exactly what I'm looking for, but I thought others might find it interesting. The Turbion project has a mechanism for copying changes between equivalent records in different Django models within the same database. It works by defining a translation layer (i.e. merging.ModelLayer) between two Django models, so, say if you update the "www" field in user bob#bob.com's profile, it'll automatically update the "url" field in user bob#bob.com's otherprofile.
The functionality I'm looking for is a bit different, in that I want to merge an entire (or partial) database snapshot at infrequent intervals, sort of the way the loaddata management command does.

Wow. This is going to be a complex job regardless. That said:
If I understand the needs of your project correctly, this can be something that can be done using a data migration in South. Even so, I'd be lying if I said it was going to be a joke.
My recommendation is -- and this is mostly a parrot of an assumption in your question, but I want to make it clear -- that you have one "master" table that is the base, and which has records from the other table added to it. So, table A keeps all of its existing records, and only gets additions from B. B feeds additions into A, and once done, B is deleted.
I'm hesitant to write you sample code because your actual job will be so much more complex than this, but I will anyway to try and point you in the right direction. Consider something like...
import datetime
from south.db import db
from south.v2 import DataMigration
from django.db import models
class Migration(DataMigration):
def forwards(self, orm):
for b in orm.B.objects.all():
# sanity check: does this item get copied into A at all?
if orm.A.objects.filter(username=b.username):
continue
# make an A record with the properties of my B record
a = orm.A(
first_name=b.first_name,
last_name=b.last_name,
email_address=b.email_address,
[...]
)
# save the new A record, and delete the B record
a.save()
b.delete()
def backwards(self, orm):
# backwards method, if you write one
This would end up migrating all of the Bs not in A to A, and leave you a table of Bs that are expected duplicates, which you could then check by some other means before deleting.
Like I said, this sample isn't meant to be complete. If you decide to go this route, spend time in the South documentation, and particularly make sure you look at data migrations.
That's my 2¢. Hope it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.