Many to Many Relationships with Postgres COPY Command

Many to Many Relationships with Postgres COPY Command - python

Is it possible to include many-to-many relationships when running a Postgres COPY command? If so, can you give me an example?
For example:
CREATE TABLE "lap" (
"id" serial NOT NULL PRIMARY KEY,
"Lap_number" integer,
"Lap_time" interval,
)
;
CREATE TABLE "datasinglerace_Laps" (
"id" serial NOT NULL PRIMARY KEY,
"datasinglerace_id" integer NOT NULL,
"lap_id" integer NOT NULL REFERENCES "lap" ("id") DEFERRABLE INITIALLY DEFERRED,
UNIQUE ("datasinglerace_id", "lap_id")
)
;
CREATE TABLE "datasinglerace" (
"id" serial NOT NULL PRIMARY KEY,
"Notes" text,
)
;
ALTER TABLE "datasinglerace_Laps" ADD CONSTRAINT "datasinglerace_id_refs_id_620382df" FOREIGN KEY ("datasinglerace_id")
REFERENCES "datasinglerace" ("id") DEFERRABLE INITIALLY DEFERRED;
The lap objects are already in the db. For the COPY file, I'd like to put the info for the datasinglerace id's and a list of the lap object's id's I want to attach. There will be a variable number of lap objects I want to attach.
This SQL was created with the Django framework. I want to keep this in the Django framework so I don't want to change the SQL. Importing the data has been really slow, so I'm working on improving the import data speed.

You can use COPY to improve the speed for importaing batches of data once off - I wouldn't use it normally - have you isolated the bottle neck? What are you copying from? You would need something like CSV files with the same structure as the tables..
COPY copies directly to PostgreSQL it has nothing to do with Django so you will need interact with Postgres using pgsql or whatever tool you use. Your command will look like this:
COPY datasinglerace FROM datasinglerace.csv;
COPY datasinglerace_Laps FROM datasinglerace_Laps.csv;
COPY has many options see the documentation.
Note anything that is referenced by something else needs to be added first otherwise you would need to relax (delete then add back) your reference constraints. In this case you would need to COPY datasinglerace_Laps last so the references it has already exist

Related

How do I increase the speed of a bulk UPSERT in postgreSQL?

I am trying to load many millions of data records, from multiple distinct sources, to a postgresql table with the following design:
CREATE TABLE public.variant_fact (
variant_id bigint NOT NULL,
ref_allele text NOT NULL,
allele text NOT NULL,
variant_name text NOT NULL,
start bigint,
stop bigint,
variant_attributes jsonb
);
ALTER TABLE public.variant_fact
ADD CONSTRAINT variant_fact_unique UNIQUE (variant_name, start, stop, allele, ref_allele)
INCLUDE (ref_allele, allele, variant_name, start, stop);
Where "start" and "stop" are foreign keys and "variant_id" is an auto-incrementing primary key. I am running into issues with the loading speed because in order to perform the UPSERT, I need to check the table to see whether an element exists for each element I upload. I am performing the operation in python using psycopg2 using the execute_values method.
insert_query = """
INSERT INTO variant_fact AS v (variant_id, ref_allele, allele, variant_name, start, stop, variant_attributes)
VALUES %s
ON CONFLICT ON CONSTRAINT variant_fact_unique DO UPDATE
SET variant_attributes = excluded.variant_attributes || v.variant_attributes
RETURNING variant_id;
"""
inserted = psycopg2.extras.execute_values(cur=cursor, sql=sql, argslist=argslist, template=None, page_size=50000, fetch=fetch)
In my case, argslist is a list of tuples to insert to the database. I have tried to milk this python script for speed, but this UPSERT block is not very performant. Outside of a different schema (perhaps without atomic element records), are there any ways to boost performance for upload? I have already turned off WAL for the table and removed the foreign key constraints for "start" and "stop". Am I missing anything obvious here?

Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).
Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.

Turning off WAL (setting the table UNLOGGED) means that the table will be empty after a crash, because it cannot be recovered. If you are considering running ALTER TABLE later to change it to a LOGGED table, know that this operation will dump the whole table into WAL, so you won't win anything.
For a simple statement like that on an unlogged table, the only way to speed it up are:
drop all indexes, triggers and constraints except variant_fact_unique – but creating them again will be expensive, so you might not win overall
make sure you have fast storage and enough RAM

Python/ Django Key already exists. Postgres

I Have a project built in django and it uses a postgres database.
This database was populated by CSVs files. So when I want to insert a new object I got the error "duplicated key" because the object with id = 1 already exists.
The code :
user = User(name= "Foo")
user.save()
The table users has the PK on the id.
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
If I get the table's details in psql I got:
Column| Type | Modifiers
------+-------- +--------------------------------------
id | integer | not null default nextval('users_id_seq'::regclass)
Additionally, if I do user.dict after create the variable user and before saving it, I get 'id': None
How can I save the user with an id that is not being used?

You most likely inserted your Users from the CSV setting the id value explicitly, when this happens the postgres sequence is not updated and as a result of that when you try to add a new user the sequence generates an already used value
Check this other question for reference postgres autoincrement not updated on explicit id inserts
The solution is what the answer for that question says, update your sequence manually

You can fix it by setting users_id_seq manually.
SELECT setval('users_id_seq', (SELECT MAX(id) from "users"));

Unless you have name as a primary key for the table the above insert should work. If you have name as primary key remove it and try it.

In Postgres SQL you can specify id as serial and you can mark it as Primary Key.Then whenever you will insert record , it will be in a sequence.
i.e id serial NOT NULL and
CONSTRAINT primkey PRIMARY KEY (id).
As you said its a pre populated by CSV , so when you insert it from python code it will automatically go the end of the table and there will be no duplicate values.

Sqlite insert not working with python

I'm working with sqlite3 on python 2.7 and I am facing a problem with a many-to-many relationship. I have a table from which I am fetching its primary key like this
current.execute("SELECT ExtensionID FROM tblExtensionLookup where ExtensionName = ?",[ext])
and then i am fetching another primary key from another table
current.execute("SELECT HostID FROM tblHostLookup where HostName = ?",[host])
now what i am doing is i have a third table with these two keys as foreign keys and i inserted them like this
current.execute("INSERT INTO tblExtensionHistory VALUES(?,?)",[Hid,Eid])
The problem is i don't know why but the last insertion is not working it keeps giving errors. Now what i have tried is:
First I thought it was because I have an autoincrement primary id for the last mapping table which I didn't provide, but isn't it supposed to consider itself as it's auto incremented? However I went ahead and tried adding Null,None,0 but nothing works.
Secondly I thought maybe because i'm not getting the values from tables above so I tried printing it out and it shows so it works.
Any suggestions what I am doing wrong here?
EDIT :
When i don't provide primary key i get error as
The table has three columns but you provided only two values
and when i do provide them as None,Null or 0 it says
Parameter 0 is not supported probably because of unsupported type
I tried implementing the #abarnet way but still keeps saying parameter 0 not supported
connection = sqlite3.connect('WebInfrastructureScan.db')
with connection:
current = connection.cursor()
current.execute("SELECT ExtensionID FROM tblExtensionLookup where ExtensionName = ?",[ext])
Eid = current.fetchone()
print Eid
current.execute("SELECT HostID FROM tblHostLookup where HostName = ?",[host])
Hid = current.fetchone()
print Hid
current.execute("INSERT INTO tblExtensionHistory(HostID,ExtensionID) VALUES(?,?)",[Hid,Eid])
EDIT 2 :
The database schema is :
table 1:
CREATE TABLE tblHostLookup (
HostID INTEGER PRIMARY KEY AUTOINCREMENT,
HostName TEXT);
table2:
CREATE TABLE tblExtensionLookup (
ExtensionID INTEGER PRIMARY KEY AUTOINCREMENT,
ExtensionName TEXT);
table3:
CREATE TABLE tblExtensionHistory (
ExtensionHistoryID INTEGER PRIMARY KEY AUTOINCREMENT,
HostID INTEGER,
FOREIGN KEY(HostID) REFERENCES tblHostLookup(HostID),
ExtensionID INTEGER,
FOREIGN KEY(ExtensionID) REFERENCES tblExtensionLookup(ExtensionID));

It's hard to be sure without full details, but I think I can guess the problem.
If you use the INSERT statement without column names, the values must exactly match the columns as given in the schema. You can't skip over any of them.*
The right way to fix this is to just use the column names in your INSERT statement. Something like:
current.execute("INSERT INTO tblExtensionHistory (HostID, ExtensionID) VALUES (?,?)",
[Hid, Eid])
Now you can skip any columns you want (as long as they're autoincrement, nullable, or otherwise skippable, of course), or provide them in any order you want.
For your second problem, you're trying to pass in rows as if they were single values. You can't do that. From your code:
Eid = current.fetchone()
This will return something like:
[3]
And then you try to bind that to the ExtensionID column, which gives you an error.
In the future, you may want to try to write and debug the SQL statements in the sqlite3 command-line tool and/or your favorite GUI database manager (there's a simple extension that runs in for Firefox if you don't want anything fancy) and get them right, before you try getting the Python right.
* This is not true with all databases. For example, in MSJET/Access, you must skip over autoincrement columns. See the SQLite documentation for how SQLite interprets INSERT with no column names, or similar documentation for other databases.

Set SQLAlchemy to use PostgreSQL SERIAL for identity generation

Background:
The application I am currently developing is in transition from SQLite3 to PostgreSQL. All the data has been successfully migrated, using the .dump from the current database, changing all the tables of the type
CREATE TABLE foo (
id INTEGER NOT NULL,
bar INTEGER,
...
PRIMARY KEY (id),
FOREIGN KEY(bar) REFERENCES foobar (id),
...
);
to
CREATE TABLE foo (
id SERIAL NOT NULL,
bar INTEGER,
...
PRIMARY KEY (id),
FOREIGN KEY(bar) REFERENCES foobar (id) DEFERRABLE,
...
);
and SET CONSTRAINTS ALL DEFERRED;.
Since I am using SQLAlchemy I was expecting things to work smoothly from then on, after of course changing the engine. But the problem seems to be with the autoincrement of the primary key to a unique value on INSERT.
The table, say foo, I am currently having trouble with has 7500+ rows but the sequence foo_id_seq's current value is set on 5(because I have tried the inserts five times now all of which have failed).
Question:
So now my question is that without explicitly supplying the id, in the INSERT statement, how can I make Postgres automatically assign a unique value to the id field if foo? Or more specifically, have the sequence return a unique value for it?
Sugar:
Achieve all that through the SQLAlchemy interface.
Environment details:
Python 2.6
SQLAlchemy 8.2
PostgreSQL 9.2
psycopg2 - 2.5.1 (dt dec pq3 ext)
PS: If anybody finds a more appropriate title for this question please edit it.

Your PRIMARY KEY should be defined to use a SEQUENCE as a DEFAULT, either via the SERIAL convenience pseudo-type:
CREATE TABLE blah (
id serial primary key,
...
);
or an explicit SEQUENCE:
CREATE SEQUENCE blah_id_seq;
CREATE TABLE blah (
id integer primary key default nextval('blah_id_seq'),
...
);
ALTER SEQUENCE blah_id_seq OWNED BY blah.id;
This is discussed in the SQLAlchemy documentation.
You can add this to an existing table:
CREATE SEQUENCE blah_id_seq OWNED BY blah.id;
ALTER TABLE blah ALTER COLUMN id SET DEFAULT nextval('blah_id_seq');
if you prefer to restore a dump then add sequences manually.
If there's existing data you've loaded directly into the tables with COPY or similar, you need to set the sequence starting point:
SELECT setval('blah_id_seq', max(id)+1) FROM blah;
I'd say the issue is likely to be to do with your developing in SQLite, then doing a dump and restoring that dump to PostgreSQL. SQLAlchemy expects to create the schema its self with the appropriate defaults and sequences.
What I recommend you do instead is to get SQLAlchemy to create a new, empty database. Dump the data for each table from the SQLite DB to CSV, then COPY that data into the PostgreSQL tables. Finally, update the sequences with setval so they generate the appropriate values.
One way or the other, you will need to make sure that the appropriate sequences are created. You can do it by SERIAL pseudo-column types, or by manual SEQUENCE creation and DEFAULT setting, but you must do it. Otherwise there's no way to assign a generated ID to the table in an efficient, concurrency-safe way.

Use
alter sequence foo_id_seq restart with 7600
should give you 7601 next time you call the sequence.
http://www.postgresql.org/docs/current/static/sql-altersequence.html
And then subsequent values. Just make sure that you restart it with a value > the last id.

Issue with SQLite DROP TABLE statement

EDIT: At this point, I found the errant typo that was responsible, and my question has become "How did the typo that I made cause the error that I received" and "How might I have better debugged this in the future?"
I've setup a database script for SQLite (through pysqlite) as follows:
DROP TABLE IF EXISTS LandTerritory;
CREATE TABLE LandTerritory (
name varchar(50) PRIMARY KEY NOT NULL UNIQUE,
hasSC boolean NOT NULL DEFAULT 0
);
I'm expecting this to always run without error. However, if I run this script twice, (using the sqlite.Connection.executescript method) I get this error:
OperationalError:table LandTerritory already exists
Trying to debug this myself, I run DROP TABLE LandTerritory on its own and get:
sqlite3.OperationalError: no such table: main.LandTerrito
I'm guessing this has something to do with the "main." part, but I'm not sure what.
EDIT:
Okay PRAGMA foreign_keys=ON is definitely involved here, too. When I create my connection, I turned on foreign_keys. If I don't turn that on, I don't seem to get this error.
And I should have mentioned that there's more to the script, but I had assumed the error was occurring in these first 2 statements. The rest of the script just does the same, drop table, define table. A few of the tables have foreign key references to LandTerritory.
Is there a way to get something like line number information about the sqlite errors? That would be really helpful.
EDIT 2:
Okay, here's another table in the script that references the first.
DROP TABLE IF EXISTS LandAdjacent;
CREATE TABLE LandAdjacent (
tname1 varchar(50) NOT NULL,
tname2 varchar(50) NOT NULL,
PRIMARY KEY (tname1, tname2),
/* Foreign keys */
FOREIGN KEY (tname1)
REFERENCES LandTerrito
ON DELETE CASCADE
ON UPDATE CASCADE,
FOREIGN KEY (tname2)
REFERENCES LandTerritory(name)
ON DELETE CASCADE
ON UPDATE CASCADE
);
Looking at this, I found were the "LandTerrito" came from, somehow a few characters got cut off. I'm guessing fixing this may fix my problem.
But I'm really confused how a broken line in this table led to the script running correctly the first time, and then giving me an error related to a different table when I run it the second time, and how foreign keys played into this.
I guess, to reiterate from above, is there a better way to debug this sort of thing?

The source of the error is your typo
REFERENCES LandTerrito
in line 8 of your script. This leads to the "missing" table LandTerrito in the CREATE TABLE LandAdjacent statement.
If you run your two CREATE TABLE statements Sqlite wont complain. But if you have PRAGMA foreign_keys=ON; and try to run an INSERT or DELETE statement on the table LandAdjacent, you'll get the error no such table: main.LandTerrito.
Because of the foreign key constraints DROP TABLE on LandTerritory however will result in a DELETE on the table LandAdjacent, which triggers the error.
The following things will avoid the error
set PRAGMA foreign_keys=ON; before you drop the table (tested) or
add a dummy table LandTerrito (tested) or
drop LandAdjacent first, then LandTerritory (tested) or
dont use ON DELETE CASCADE (not tested)
and of course correcting the original typo.

Put a "GO" (or whatever equivalent is used in SQLlite) to terminate a batch between the drop table statement and the create statement

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.