Python - insert nested dict to sqlite - performance issues - python

I am trying to extract data from a dictionary where performance is very prioritized.
I have ~1 million dicts to process (but this is going to scale up). However the dict can on occasion miss value.
The current solution is based upon the results below but I currently iterate the list to get the input, then pop course, insert it into an unique list if it does not exist (to remove redundancy) and then insert the id back to input with name of "course_id".
Problem:
This is far from efficient but out of ideas or keywords for an better approach. Does Sqlite have any tricks to insert into different tables or can I separate the items in a better way?
My idea:
id is a user_id and the course is an row that should be placed within another table. This data should be inserted to an sqlite database but not sure how to handle the conversion.
Input:
'id':user_id,
'course':{
'id':course.get('id'),
'name':course.get('name')
},
'interests':interests.get('id'),
'address':address.get('name'),
Expected results into user table:
'id':123,
'course_id':1, # ForeignKey to course id given above.
'interests':'Likes to code',
'address':'Parents house',
Expected results into course table:
'id':1,
'name':'some course'
I am currently doing a bulk insert using SQLAlchemy core where I create a list of all items to be inserted and then call execute. Example query:
text('INSERT OR IGNORE INTO users (id, course_id, interests, address) VALUES (:id, :course_id, :interests, :address)'),

Related

Insert record into PostgreSQL table and expire old record

I have a set of data that gets updated periodically by a client. Once a month or so we will download a new set of this data. The dataset is about 50k records with a couple hundred columns of data.
I am trying to create a database that houses all of this data so we can run our own analysis on it. I'm using PostgreSQL and Python (psycopg2).
Occasionally, the client will add columns to the dataset, so there are a number of steps I want to take:
Add new records to the database table
Compare the old set of data with the new set of data and update the table where necessary
Keep the old records, and either add an "expired" flag, or an "db_expire_date" to keep track of whether a record is active or expired
Add any new columns of data to the database for all records
I know how to add new records to the database (1) using INSERT INTO, and how to add new columns of data to the database (4) using ALTER TABLE. But having issues with (2) and (3). I figured out how to update a record, using the following code:
rows = zip(*[update_records[col] for col in update_records])
cursor = conn.cursor()
cursor.execute("""CREATE TEMP TABLE temptable (""" + schema_list + """) ON COMMIT DROP""")
cursor.executemany("""INSERT INTO temptable (""" + var +""") VALUES ("""+ perc_s + """)""", rows)
cursor.execute("""
UPDATE tracking.test_table
SET mfg = temptable.mfg, db_updt_dt = CURRENT_TIMESTAMP
FROM temptable
WHERE temptable.app_id = tracking.test_table.app_id;
""");
cursor.rowcount
conn.commit()
cursor.close()
conn.close()
However, this just updated the record based on the app_id as the primary key.
What I'd like to figure out is how to keep the original record and set it as "expired" and then create a new, updated record. It seems that "app_id" shouldn't be my primary key, so i've created a new primary key as '"primary_key" INT GENERATED ALWAYS AS IDENTITY not null,'.
I'm just not sure where to go from here. I think that I could probably just use INSERT INTO to send the new records to the database. But i'm not sure how to "expire" the old records that way. Possibly I could use UPDATE table to set the older values to "expired". But I am wondering if there is a more straightforward way to do this.
I hope my question is clear. I'm hoping someone can point me in the right direction. Thanks
A pretty standard data warehousing technique is to define two additional date fields, a from-effective-date and a to-effective-date. You only append rows, never update. You add the candidate record if the source primary key does not exist in your table OR if any column value is different from the most recently added prior record in your table with the same primary key. (Each record supersedes the last).
As you add your record to the table you do 3 things:
The New record's from-effective-date gets the transaction file's date
The New record's to-effective-date gets a date WAY in the future, like 9999-12-31. The important thing here is that it will not expire until you say so.
The most recent prior record (the one you compared values for changes) has its to-effective-date Updated to the transaction file's date minus one day. This has the effect of expiring the old record.
This creates a chain of records with the same source primary key with each one covering a non-overlapping time period. This format is surprisingly easy to select from:
If you want to reproduce the most current transaction file you select Where to-effective-date > Current Date
If you want to reproduce the transaction file at any date for a report, you select Where myreportdate Between from-effective-date And to-effective-date.
If you want the entire update history for a key you select * Where the key = mykeyvalue Order By from-effective-date.
The only thing that is ugly about this scheme is when columns are added, the comparison test also must be altered to include those new columns in case something changes. If you want that to be dynamic, you're going to have to loop through the reflection meta data for each column in the table, but Python will need to know how comparing a text field might be different from comparing a BLOB, for example.
If you actually care about having a primary key (many data warehouses do not have primary keys) you can define a compound key on the source primary key + one of those effective dates, it doesn't really matter which one.
You're looking for the concept of a "natural key", which is how you would identify a unique row, regardless of what the explicit logical constraints on the table are.
This means that you're spot on that you need to change your primary key to be more inclusive. Your new primary key doesn't actually help you decipher which row you are looking for once you have both in there unless you already know which row you are looking for (that "identity" field).
I can think of two likely candidates to add to your natural key: date, or batch.
Either way, you would look for "App = X, [Date|batch] = Y" in the data to find that one. Batch would be upload 1, upload 2, etc. You just make it up, or derive it from the date, or something along those lines.
If you aren't sure which to add, and you aren't ever going to upload multiple times in one day, I would go with Date. That will give you more visibility over time, as you can see when and how often things change.
Once you have a natural key, you want to make it explicit in your data. You can either keep your identity column (see: Surrogate Key) or you can have a compound primary key. With no other input or constraints, I would go with a compound primary key for your situation.
I'm a MySQL DBA, so I'm cribbing a bit from the docs here: https://www.postgresqltutorial.com/postgresql-primary-key/
You do NOT want this:
CREATE TABLE test_table (
app_id INTEGER PRIMARY KEY,
date DATE,
active BOOLEAN
);
Instead, you want this:
CREATE TABLE test_table (
app_id INTEGER,
date DATE,
active BOOLEAN,
PRIMARY KEY (app_id, date)
);
I've added an active column here as well, since you wanted to deactivate rows. This isn't explicitly necessary from what you've described though - you can always assume the most recent upload is active. Or you can expand the columns to have a "active_start" date and an "active_end" date, which will enable another set of queries. But for what you've stated here so far, just the date column should suffice. :)
For step 2)
First, you have to identify the records that have the same data for this you can run a select query with where clause before inserting any recode and count the number of records you receive as output. If the count is more than 0 don't insert the recode otherwise you can insert the recode.
For step 3)
For this, you can insert a column as you mention above with the name 'db_expire_date' and insert the expiration value at the time of record insertion only.
You can also use a column like 'is_expire' but for that, you need to add a cron job that can update the DB periodically for the value of this column.

How do I increase the speed of a bulk UPSERT in postgreSQL?

I am trying to load many millions of data records, from multiple distinct sources, to a postgresql table with the following design:
CREATE TABLE public.variant_fact (
variant_id bigint NOT NULL,
ref_allele text NOT NULL,
allele text NOT NULL,
variant_name text NOT NULL,
start bigint,
stop bigint,
variant_attributes jsonb
);
ALTER TABLE public.variant_fact
ADD CONSTRAINT variant_fact_unique UNIQUE (variant_name, start, stop, allele, ref_allele)
INCLUDE (ref_allele, allele, variant_name, start, stop);
Where "start" and "stop" are foreign keys and "variant_id" is an auto-incrementing primary key. I am running into issues with the loading speed because in order to perform the UPSERT, I need to check the table to see whether an element exists for each element I upload. I am performing the operation in python using psycopg2 using the execute_values method.
insert_query = """
INSERT INTO variant_fact AS v (variant_id, ref_allele, allele, variant_name, start, stop, variant_attributes)
VALUES %s
ON CONFLICT ON CONSTRAINT variant_fact_unique DO UPDATE
SET variant_attributes = excluded.variant_attributes || v.variant_attributes
RETURNING variant_id;
"""
inserted = psycopg2.extras.execute_values(cur=cursor, sql=sql, argslist=argslist, template=None, page_size=50000, fetch=fetch)
In my case, argslist is a list of tuples to insert to the database. I have tried to milk this python script for speed, but this UPSERT block is not very performant. Outside of a different schema (perhaps without atomic element records), are there any ways to boost performance for upload? I have already turned off WAL for the table and removed the foreign key constraints for "start" and "stop". Am I missing anything obvious here?
Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).
Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.
Turning off WAL (setting the table UNLOGGED) means that the table will be empty after a crash, because it cannot be recovered. If you are considering running ALTER TABLE later to change it to a LOGGED table, know that this operation will dump the whole table into WAL, so you won't win anything.
For a simple statement like that on an unlogged table, the only way to speed it up are:
drop all indexes, triggers and constraints except variant_fact_unique – but creating them again will be expensive, so you might not win overall
make sure you have fast storage and enough RAM

Python, SQL: How to update multiple rows and columns in a single trip around the database?

Hello StackEx community.
I am implementing a relational database using SQLite interfaced with Python. My table consists of 5 attributes with around a million tuples.
To avoid large number of database queries, I wish to execute a single query that updates 2 attributes of multiple tuples. These updated values depend on the tuples' Primary Key value and so, are different for each tuple.
I am trying something like the following in Python 2.7:
stmt= 'UPDATE Users SET Userid (?,?), Neighbours (?,?) WHERE Username IN (?,?)'
cursor.execute(stmt, [(_id1, _Ngbr1, _name1), (_id2, _Ngbr2, _name2)])
In other words, I am trying to update the rows that have Primary Keys _name1 and _name2 by substituting the Neighbours and Userid columns with corresponding values. The execution of the two statements returns the following error:
OperationalError: near "(": syntax error
I am reluctant to use executemany() because I want to reduce the number of trips across the database.
I am struggling with this issue for a couple of hours now but couldn't figure out either the error or an alternate on the web. Please help.
Thanks in advance.
If the column that is used to look up the row to update is properly indexed, then executing multiple UPDATE statements would be likely to be more efficient than a single statement, because in the latter case the database would probably need to scan all rows.
Anyway, if you really want to do this, you can use CASE expressions (and explicitly numbered parameters, to avoid duplicates):
UPDATE Users
SET Userid = CASE Username
WHEN ?5 THEN ?1
WHEN ?6 THEN ?2
END,
Neighbours = CASE Username
WHEN ?5 THEN ?3
WHEN ?6 THEN ?4
END,
WHERE Username IN (?5, ?6);

Effective batch "update-or-insert" in SqlAlchemy

There exists a table Users and in my code I have a big list of User objects. To insert them I can use :
session.add_all(user_list)
session.commit()
The problem is that there can be several duplicates which I want to update but the database wont allow to insert duplicate entries. For sure, I can iterate over user_list and try to insert user in the database and if it fails - update it :
for u in users:
q = session.query(T).filter(T.fullname==u.fullname).first()
if q:
session.query(T).filter_by(index=q.index).update({column: getattr(u,column) for column in Users.__table__.columns.keys() if column!='id'})
session.commit()
else:
session.add(u)
session.commit()
but I find this solution quiet ineffective : first, I am making several requests to retrieve object q, and instead of batch inserting of new items I insert them one per one. I wonder if there exists a better solution for this task.
UPD better version:
for u in users:
q = session.query(T).filter(Users.fullname==u.fullname).first()
if q:
for column in Users.__table__.columns.keys():
if not column=='index':
setattr(q,column,getattr(u,column))
session.add(q)
else:
session.add(u)
session.commit()
a better solution would be to use
INSERT ... ON DUPLICATE KEY UPDATE ...
bulk MySQL construct (I assume you're using MySQL because your post is tagged with 'mysql'). This way you're both inserting new entries and updating existing ones in one statement / transaction, see http://dev.mysql.com/doc/refman/5.6/en/insert-on-duplicate.html
It's not ideal if you have multiple unique indexes and, depending on your schema, you'll have to fill in all NOT NULL values (hence issuing one bulk SELECT before calling it), but it's definitely the most efficient option and we use it a lot. The bulk version will look something like (let's assume name is a unique key):
INSERT INTO User (name, phone, ...) VALUES
('ksmith', '111-11-11', ...),
('jford', '222-22,22', ...),
...,
ON DUPLICATE KEY UPDATE
phone = VALUES(phone),
... ;
Unfortunately, INSERT ... ON DUPLICATE KEY UPDATE ... is not supported natively by SQLa so you'll have to implement a little helper function which will build the query for you.

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

Categories

Resources