Splitting a CSV table into SQL Tables with Foreign Keys - python

Say I have the following CSV file
Purchases.csv
+--------+----------+
| Client | Item |
+--------+----------+
| Mark | Computer |
| Mark | Lamp |
| John | Computer |
+--------+----------+
What is the best practice, in Python, to split this table into two separate tables and join them in a bridge table using foreign key, i.e.
Client table
+----------+--------+
| ClientID | Client |
+----------+--------+
| 1 | Mark |
| 2 | John |
+----------+--------+
Item table
+--------+----------+
| ItemID | Item |
+--------+----------+
| 1 | Computer |
| 2 | Lamp |
+--------+----------+
Item Client Bridge Table
+----------+--------+
| ClientID | ItemID |
+----------+--------+
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
+----------+--------+
I should mention here that it possible for records to already exist in the tables, i.e., if the Client Name in the CSV already has an assigned ID in the Client Table, this ID should be used in the Bridge table. This is because I have to do a one-time batch upload of a million line of data, and then insert a few thousands line of data daily.
I have also already created the tables, they are in the database, just empty at the moment

You would do this in the database (or via database commands in Python). The data never needs to be loaded into Python.
Load the purchases.csv table into a staging table in the database. Then be sure you have your tables defined:
create table clients (
clientId int generated always as identity primary key,
client varchar(255)
);
create table items (
itemId int generated always as identity primary key,
item varchar(255)
);
create table clientItems (
clientItemId int generated always as identity primary key,
clientId int references clients(clientId),
itemId int references items(itemId)
);
Note that the exact syntax for these depends on the database. Then load the tables:
insert into clients (client)
select distinct s.client
from staging s
where not exists (select 1 from clients c where c.client = s.client);
insert into items (item)
select distinct s.item
from staging s
where not exists (select 1 from items i where i.item = s.item);
I'm not sure if you need to take duplicates into account for ClientItems:
insert into ClientItems (clientId, itemId)
select c.clientId, i.itemId
from staging s join
clients c
on s.client = c.client join
items i
on s.item = i.item;
If you need to prevent duplicates here, then:
where not exists (select 1
from clientitems ci join
clients c
on c.clientid = ci.clientid join
items i
on i.itemid = ci.itemid
where c.client = s.client and i.item = s.item
);

Related

Postgres deadlocks during concurrent upserts from temporary tables

I have process that is controlled by Airflow that generates a number of tasks performing concurrent inserts to a Postgres database.
Each task takes a pandas dataframe, inserts the rows to a temporary table, then upserts from the temporary table to the target table. This is leading to deadlocks, but I am having a tough time understanding how to mitigate this issue. I have pulled out the salient components here, though please let me know if I have failed to include enough information.
I am in python 3.8.2, postgres 11.7, airflow 1.10.10, and using psycopg2 as an odbc connection.
# create temp table like target table
temp_table_sql = 'CREATE TEMP TABLE mur_global_raw_tmp_61400102 (Like mur_global_raw INCLUDING IDENTITY);'
cur.execute(temp_table_sql)
# serialize dataframe and copy to temp table
pd_df_serial = StringIO()
pd_df.to_csv(pd_df_serial, sep='\t', header=False, index=False)
pd_df_serial.seek(0)
cur.copy_from(pd_df_serial, temp_table_name, null="", columns=pd_df.columns.to_list())
conn.commit()
# upsert from temp table to target table
pd_df_insert_sql = 'INSERT INTO mur_global_raw(lat,lon,time,analysed_sst)
(SELECT lat,lon,time,analysed_sst FROM mur_global_raw_tmp_61400102
as tmp_vals ORDER BY lat,lon,time,analysed_sst)
ON CONFLICT DO NOTHING;'
cur.execute(pd_df_insert_sql)
conn.commit()
Here is the schema of the temporary table.
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+--------------------------+-----------+----------+----------------------------------+---------+--------------+-------------
ind | bigint | | not null | generated by default as identity | plain | |
lat | double precision | | | | plain | |
lon | double precision | | | | plain | |
time | timestamp with time zone | | | | plain | |
analysed_sst | double precision | | | | plain | |
And here is the schema of the target table.
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+--------------------------+-----------+----------+----------------------------------+---------+--------------+-------------
ind | bigint | | not null | generated by default as identity | plain | |
lat | double precision | | | | plain | |
lon | double precision | | | | plain | |
time | timestamp with time zone | | | | plain | |
analysed_sst | double precision | | | | plain | |
Indexes:
"mur_global_raw_pkey" PRIMARY KEY, btree (ind)
And finally, here is a sample from the server log:
2020-06-22 23:03:36 UTC::#:[3570]:LOG: checkpoint starting: xlog
2020-06-22 23:03:42 UTC:xxxxx(38068):postgres#public_data_raw:[13975]:WARNING: there is no transaction in progress
2020-06-22 23:03:43 UTC:xxxxx(38090):postgres#public_data_raw:[13993]:ERROR: deadlock detected
2020-06-22 23:03:43 UTC:xxxxx(38090):postgres#public_data_raw:[13993]:DETAIL: Process 13993 waits for ShareLock on transaction 42977; blocked by process 14014.
Process 14014 waits for ShareLock on transaction 42981; blocked by process 14021.
Process 14021 waits for ShareLock on transaction 42980; blocked by process 13993.
Process 13993: INSERT INTO mur_global_raw(lat,lon,time,analysed_sst) (SELECT lat,lon,time,analysed_sst FROM mur_global_raw_tmp_75410038 as tmp_vals ORDER BY lat,lon,time,analysed_sst) ON CONFLICT DO NOTHING;
Process 14014: INSERT INTO mur_global_raw(lat,lon,time,analysed_sst) (SELECT lat,lon,time,analysed_sst FROM mur_global_raw_tmp_41473761 as tmp_vals ORDER BY lat,lon,time,analysed_sst) ON CONFLICT DO NOTHING;
Process 14021: INSERT INTO mur_global_raw(lat,lon,time,analysed_sst) (SELECT lat,lon,time,analysed_sst FROM mur_global_raw_tmp_28913605 as tmp_vals ORDER BY lat,lon,time,analysed_sst) ON CONFLICT DO NOTHING;
2020-06-22 23:03:43 UTC:xxxxx(38090):postgres#public_data_raw:[13993]:HINT: See server log for query details.
2020-06-22 23:03:43 UTC:xxxxx(38090):postgres#public_data_raw:[13993]:CONTEXT: while inserting index tuple (1969403,34) in relation "mur_global_raw"
2020-06-22 23:03:43 UTC:xxxxx(38090):postgres#public_data_raw:[13993]:STATEMENT: INSERT INTO mur_global_raw(lat,lon,time,analysed_sst) (SELECT lat,lon,time,analysed_sst FROM mur_global_raw_tmp_75410038 as tmp_vals ORDER BY lat,lon,time,analysed_sst) ON CONFLICT DO NOTHING;
These deadlocks are happening persistently and regularly, so hopefully there is a component of the design that I can address to avoid them. My understanding of the locks going on is clearly not good enough to address the problem at this stage.
If anyone can help me understand the locks and transactions that are leading to this three-way deadlock, I would most appreciate it. Of course, if you have an idea for how to avoid it, I welcome that as well.
My humble thanks to the SO community.
THe best workaround I've got is to add an exlusive lock before starting the upsert, like so:
LOCK TABLE mur_global_raw IN EXCLUSIVE MODE;
Any comments welcome.
If you cannot figure out a better way, catch the deadlock errors and repeat the transaction. If the deadlocks happen a lot, that is annoying and will harm performance, but it is better than a table lock, because it won't prevent autovacuum from doing its important work.
Perhaps you can reduce the size or duration of the batches to make a deadlock less likely.

SQL multiple table conditional retrieving statement

I'm trying to retrieve data from an sqlite table conditional on another sqlite table within the same database. The formats are the following:
master table
--------------------------------------------------
| id | TypeA | TypeB | ...
--------------------------------------------------
| 2020/01/01 | ID_0-0 |ID_0-1 | ...
--------------------------------------------------
child table
--------------------------------------------------
| id | Attr1 | Attr2 | ...
--------------------------------------------------
| ID_0-0 | 112.04 |-3.45 | ...
--------------------------------------------------
I want to write an sqlite3 query that takes:
A date D present in master.id
A list of types present in master's columns
A list of attributes present in child's columns
and returns a dataframe with rows the types and columns the attributes. Of course, I could just read the tables into pandas and do the work there, but I think it would be more computationally intensive and I want to learn more SQL syntax!
UPDATE
So far I've tried:
"""SELECT *
FROM child
WHERE EXISTS
(SELECT *
FROM master
WHERE master.TypeX = Type_input
WHERE master.dateX = date_input)"""
and then concatenate the strings over all required TypeX and dateX and execute with:
cur.executescripts(script)

Get the intersection of two many-to-many relationship of specific values

N.B. I have tagged this with SQLAlchemy and Python because the whole point of the question was to develop a query to translate into SQLAlchemy. This is clear in the answer I have posted. It is also applicable to MySQL.
I have three interlinked tables I use to describe a book. (In the below table descriptions I have eliminated extraneous rows to the question at hand.)
MariaDB [icc]> describe edition;
+-----------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
+-----------+------------+------+-----+---------+----------------+
7 rows in set (0.001 sec)
MariaDB [icc]> describe line;
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| edition_id | int(11) | YES | MUL | NULL | |
| line | varchar(200) | YES | | NULL | |
+------------+--------------+------+-----+---------+----------------+
5 rows in set (0.001 sec)
MariaDB [icc]> describe line_attribute;
+------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------+------+-----+---------+-------+
| line_id | int(11) | NO | PRI | NULL | |
| num | int(11) | YES | | NULL | |
| precedence | int(11) | YES | MUL | NULL | |
| primary | tinyint(1) | NO | MUL | NULL | |
+------------+------------+------+-----+---------+-------+
5 rows in set (0.001 sec)
line_attribute.precedence is the hierarchical level of the given heading. So if War and Peace has Books > Chapters, all of the lines have an attribute that corresponds to the Book they're in (e.g., Book 1 has precedence=1 and num=1) and an attribute for the Chapter they're in (e.g., Chapter 2 has precedence=2 and num=2). This allows me to translate the hierarchical structure of books with volumes, books, parts, chapters, sections, or even acts and scenes. The primary column is a boolean, so that each and every line has one attribute that is primary. If it is a book heading, it is the Book attribute, if it is a chapter heading, it is the Chapter attribute. If it is a regular line in text, it is a line attribute, and the precedence is 0 since it is not a part of the hierarchical structure.
I need to be able to query for all lines with a particular edition_id and that also have the intersection of two line_attributes.
(This would allow me to get all lines from a particular edition that are in, say, Book 1 Chapter 2 of War and Peace).
I can get all lines that have Book 1 with
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=1 AND line_attribute.num=1;
and I can get all lines that have Chapter 2:
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=2 AND line_attribute.num=1;
Except the second query returns each chapter 2 from every book in War and Peace.
How do I get from these two queries to just the lines from book 1 chapter 2?
Warning from Raymond Nijland in the comments:
Note for future readers.. Because this question is tagged MySQL.. MySQL does not support INTERSECT keyword.. MariaDB is indeed a fork off the MySQL source code but supports extra features which MySQL does not support.. In MySQL you can simulate the INTERSECT keyword with a INNER JOIN or IN()
Trying to write up a question on SO helps me get my thoughts clear and eventually solve the problem before I have to ask the question. The queries above are much clearer than my initial queries and the question pretty much answers itself, but I never found a clear answer that talks about the intersect utility, so I'm posting this answer anyway.
The solution was the INTERSECT operator.
The solution is simply the intersection of those two queries:
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=1 AND line_attribute.num=1
INTERSECT /* it is literally this simple */
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=2 AND line_attribute.num=2;
This also means I could get all of the book and chapter headings for a particular book by simply adding an additional constraint (line_attribute.primary=1).
This solution seems broadly applicable to me. Assuming, for instance, you have questions in a StackOverflow clone, which are tagged, you can get the intersection of questions with two tags (e.g., all posts that have both the SQLAlchemy and Python tags). I am certainly going to use this method for that sort of query.
I coded this up in MySQL because it helps me get the query straight to translate it into SQLAlchemy.
The SQLAlchemy query is this simple:
[nav] In [10]: q1 = Line.query.join(LineAttribute).filter(LineAttribute.precedence==1, LineAttribute.num==1)
[ins] In [11]: q2 = Line.query.join(LineAttribute).filter(LineAttribute.precedence==2, LineAttribute.num==1)
[ins] In [12]: q1.intersect(q2).all()
Hopefully the database structure in this question helps someone down the road. I didn't want to delete the question after I solved my own problem.

Reset increment value in flask-sqlalchemy [duplicate]

This question already has answers here:
How can I reset a autoincrement sequence number in sqlite
(5 answers)
SQLite Reset Primary Key Field
(5 answers)
Closed 4 years ago.
how do I reset the increment count in flask-sqlalchemy after deleting a row so that the next insert will get the id of deleted row?
ie :
table users:
user_id | name |
________________
3 | mbuvi
4 | meshack
5 | You
when I delete user with id=5;
the next insertion into users is having id = 6 but I want it to have id=5;
user_id | name |
________________
3 | mbuvi
4 | meshack
6 | me
How do I solve this?
Your database will keep track your auto increment id! so you can't do something like this.BTW it's no about the flask-sqlalchemy question! If you really want to do this, you have to calculater the left id which you can used and fill it with that number! for example:
+----+--------+
| id | number |
+----+--------+
| 1 | 819 |
| 2 | 829 |
| 4 | 829 |
| 5 | 829 |
+----+--------+
And you have to find the id (3) and then insert with id. so this cause a query all the table util you got that position! don't no why you need to do this, but still have solution!
step01, you gotta use a cache to do this! here I recommand use redis
step02, If you want to delete any row, just simply cache your id into the redis list, the Order-Set is best optionl for you! before delete any row, save it to the cache!
step03, before insert any new row, check see if there any id aviable in your redis! if true, pop it out and insert it with the id which you pop out!
step04, the code should like below:
def insert_one(data):
r = redis.Redis()
_id = r.pop('ID_DB')
if _id:
cursor.execute("insert into xxx(id, data)values(%s, %s)", data)
else:
# use default increment id
cursor.execute("insert into xxx(data)values(%s)", data)
def delete(data, id):
# query the target which you will delete
# if you delete by condtion `id` that's best
r = redis.Redis()
r.push('ID_DB',id)
# DO the rest of you want ...
# delete ....

Django - Get sum of field A where field B is equal to either field B or field C, OVER MULTIPLE ROWS?

+-------------------+-------------------+----------+
| mac_src | mac_dst | bytes_in |
+-------------------+-------------------+----------+
| aa:aa:aa:aa:aa:aa | bb:bb:bb:bb:bb:bb | 10 |
| bb:bb:bb:bb:bb:bb | aa:aa:aa:aa:aa:aa | 20 |
| cc:cc:cc:cc:cc:cc | aa:aa:aa:aa:aa:aa | 30 |
+-------------------+-------------------+----------+
I have a table with fields mac_src, mac_dst and bytes_in.
I need to get all rows where each mac_src value that exists in the table is present in EITHER mac_src or mac_dst. I then need the sum of the fields bytes_in of all these rows.
I want to get the sum of field bytes_in of all rows where the field mac_src and mac_dst are equal, and then sort this sum from highest to lowest.
The Queryset returned should have just one entry per mac_src.
Thanks.
I don't think there's a simple way to do it with just the Django ORM. Just write an SQL query (warning: untested and probably slow SQL below):
from django.db import connection
with connection.cursor() as cursor:
cursor.execute('''
SELECT mac, SUM(total) FROM (
(SELECT mac_src AS mac, SUM(bytes_in) AS total FROM your_table GROUP BY mac_src)
UNION ALL (SELECT mac_dst AS mac, SUM(bytes_in) AS total FROM your_table WHERE mac_src != mac_dst GROUP BY mac_dst)
) AS combined_rows GROUP BY mac
''')
counts = dict(cursor.fetchall()) # {mac1: total_bytes1, ...}

Categories

Resources