Update one postgres table from another postgres table - python

I am loading a batch csv file to postgres using python (Say Table A).
I am using pandas to upload the data into chunk which is quite faster.
for chunk in pd.read_csv(csv_file, sep='|',chunksize=chunk_size,low_memory=False):
Now I want to update another table (say Table B) using A based on following rules
if there are any new records in table A which is not in table B then insert that as a new record in table B (based on Id field)
if the values changes in the Table A for the same ID which exists in Table B then update the records in table B using TableA
(There are server tables which i need to update based on Table A )
I am able to do that using below and then loop through each row, but Table A always have records around 1,825,172 and it becomes extremely slow. Any forum member can help to speed this up or suggest a alternate approach to achieve the same.
cursor.execute(sql)
records = cursor.fetchall()
for row in records:
id= 0 if row[0] is None else row[0] # Use this to match with Table B and decide insert or update
id2=0 if row[1] is None else row[1]
id2=0 if row[2] is None else row[2]

You could leverage Postgres upsert syntax, like:
insert into tableB tb (id, col1, col2)
select ta.id, ta.col1, ta.col2 from tableA ta
on conflict(id) do update
set col1 = ta.col1, col2 = ta.col2

You should do this completely inside the DBMS, not loop through the records inside your python script. That allows your DBMS to better optimize.
UPDATE TableB
SET x=y
FROM TableA
WHERE TableA.id = TableB.id
INSERT INTO TableB(id,x)
SELECT id, y
FROM TableA
WHERE TableA.id NOT IN ( SELECT id FROM TableB )

Related

How to update a column based on counts from another table?

I need to update the table actor, column numCharacters, depending on how many times each actor's actorID shows up on the characters table.
I have the following code:
cursor = connection.cursor()
statement = 'UPDATE actor SET numCharacters = (SELECT count(*) FROM characters GROUP BY actorID)';
cursor.execute(statement);
connection.commit()
Does anyone know how I could complete it?
I think your problem comes from your sub query will return multiple row, so the update statement won't know which row to update. Try updating your query to this:
UPDATE actor a
SET a.numCharacter = (
SELECT count(*)
from characters c
WHERE actorId = a.id
Group by actorId
);
db fiddle link

Best way to update certain columns of table in SQL Server based on Pandas DataFrame?

I have a table on SQL Server that looks like this, where each row has a unique combination of Event A and Event B.
`Global Rules Table
ID Event 1 | Event 2 | Validated as | Generated as | Generated with score
1 EA1 EB1 Rule Anti-Rule 0.01
2 EA1 EB2 Rule Rule 0.95
3 ... ... ... ... ...
I have another table with a Foreign Key constraint to Global Rules Table called Local Rules Table.
I have a Pandas DataFrame that looks like this
Event 1 | Event 2 | Validated as | Generated as | Generated with score
EA1 EB1 Rule Rule 0.85
EA1 EB2 Rule Rule 0.95
... ... ... ... ...
Since I have this Foreign Key constraint between Local Rules and Global Rules tables I can't use df.to_sql('Global Rules',con,if_exists='replace').
The columns which I want to update in the database based on values in dataframe are Generated as and Generated with score, so what is the best way to only update those columns in database table based on the DataFrame I have? Is there some out of the box function or library which I don't know about?
I haven't found a library to accomplish this. I started writing one myself to host on PyPi but haven't finished yet.
An inner join against an SQL temporary table works well in this case. It will only update a subset of columns in SQL and can be efficient for updating many records.
I assume you are using pyodbc for the connection to SQL server.
SQL Cursor
# quickly stream records into the temp table
cursor.fast_executemany = True
Create Temporary Table
# assuming your DataFrame also has the ID column to perform the SQL join
statement = "CREATE TABLE [#Update_Global Rules Table] (ID BIGINT PRIMARY KEY, [Generated as] VARCHAR(200), [Generated with score] FLOAT)"
cursor.execute(statement)
Insert DataFrame into a Temporary Table
# insert only the key and the updated values
subset = df[['ID','Generated as','Generated with score']]
# form SQL insert statement
columns = ", ".join(subset.columns)
values = '('+', '.join(['?']*len(subset.columns))+')'
# insert
statement = "INSERT INTO [#Update_Global Rules Table] ("+columns+") VALUES "+values
insert = [tuple(x) for x in subset.values]
cursor.executemany(statement, insert)
Update Values in Main Table from Temporary Table
statement = '''
UPDATE
[Global Rules Table]
SET
u.Name
FROM
[Global Rules Table] AS t
INNER JOIN
[#Update_Global Rules Table] AS u
ON
u.ID=t.ID;
'''
cursor.execute(statement)
Drop Temporary Table
cursor.execute("DROP TABLE [#Update_Global Rules Table]")

sqlite3: my sql query runs into endless loop

I have the following table:
id as int, prop as text, timestamp as int, json as blob
I want to find all pairs, which have the same prop and with the same timestamp. Later I want to extend the timestamp to e.g., +/- 5 sec.
I try to do it with INNER JOIN but my query runs into endless loop:
SELECT * FROM myTable c
INNER JOIN myTable c1
ON c.id != c1.id
AND c.prop = c1.prop
AND c.timestamp = c1.timestamp
Maybe my approach is wrong. What is the problem with my query? How can I do it? Actually, I need groups with these pairs.
You could try to see if the query gets faster with a GROUP BY:
SELECT * FROM myTable
WHERE (prop, timestamp) IN (
SELECT prop, timestamp
FROM myTable
GROUP BY prop, timestamp
HAVING COUNT(*) > 1
)
Although its hard to say without sample data.
If the table is huge you might have to create an index to speed up the query.

How can I insert the odd rows of a table into another table with python and postgres

I have a python program in which I want to read the odd rows from one table and insert them into another table. How can I achieve this?
For example, the first table has 5 rows in total, and I want to insert the first, third, and fifth rows into another table.
Note that the table may contains millions of rows, so the performance is very important.
I found a few methods here. Here's two of them transcribed to psycopg2.
If you have a sequential primary key, you can just use mod on it:
database_cursor.execute('SELECT * FROM table WHERE mod(primary_key_column, 2) = 1')
Otherwise, you can use a subquery to get the row number and use mod:
database_cursor.execute('''SELECT col1, col2, col3
FROM (SELECT row_number() OVER () as rnum, col1, col2, col3
FROM table)
WHERE mod(rnum, 2) = 1''')
If you have an id-type column that is guaranteed to increment by 1 upon every insert (kinda like an auto-increment index), you could always mod that to select the row. However, this would break when you begin to delete rows from the table you are selecting from.
A more complicated solution would be to use postgresql's row_number() function. The following assumes you have an id column that can be used to sort the rows in the desired order:
select r.*
from (select *,row_number() over(order by id) as row
from <tablename>
) r
where r.row % 2 = 0
Note: regardless of how you do it, the performance will NEVER really be efficient as you necessarily have to do a full table scan, and selecting all columns on a table with millions of records using a full table scan is going to be slow.

insert into a table from another table when a new record is inserted

I have 2 tables, I want to insert all records from the first table into the second one if the records do not exist. If a new row is added to the first table, it must be inserted into the seccond one.
I found this query
INSERT INTO Table2 SELECT * FROM Table1 WHERE NOT EXISTS (SELECT * FROM Table2)
but if a new row is added to table1, the row is not inserted into table2.
PS: table1 and table2 have the same fields and contain thousands of records
You could do it all in SQL - if you can run, for example, a batch every 2 minutes ...
-- with these two tables and their contents ...
DROP TABLE IF EXISTS tableA;
CREATE TABLE IF NOT EXISTS tableA(id,name,dob) AS (
SELECT 42,'Arthur Dent',DATE '1957-04-22'
UNION ALL SELECT 43,'Ford Prefect',DATE '1900-08-01'
UNION ALL SELECT 44,'Tricia McMillan',DATE '1959-03-07'
UNION ALL SELECT 45,'Zaphod Beeblebrox',DATE '1900-02-01'
);
ALTER TABLE tableA ADD CONSTRAINT pk_A PRIMARY KEY(id);
DROP TABLE IF EXISTS tableB;
CREATE TABLE IF NOT EXISTS tableB(id,name,dob) AS (
SELECT 43,'Ford Prefect',DATE '1900-08-01'
UNION ALL SELECT 44,'Tricia McMillan',DATE '1959-03-07'
UNION ALL SELECT 45,'Zaphod Beeblebrox',DATE '1900-02-01'
);
ALTER TABLE tableB ADD CONSTRAINT pk_B PRIMARY KEY(id);
-- .. this MERGE statement will ad the row with id=42 to table B ..
MERGE
INTO tableB t
USING tableA s
ON s.id = t.id
WHEN NOT MATCHED THEN INSERT (
id
, name
, dob
) VALUES (
s.id
, s.name
, s.dob
);
So it's an Access database.
Tough luck. They have no triggers.
Well then. I suppose you know how to insert a row into Access using Python, so I won't go through that.
I'll just build a scenario just after having inserted a row.
CREATE TABLE tableA (
id INTEGER
, name VARCHAR(20)
, dob DATE
, PRIMARY KEY (id)
)
;
INSERT INTO tableA(id,name,dob) VALUES(42,'Arthur Dent','1957-04-22');
INSERT INTO tableA(id,name,dob) VALUES(43,'Ford Prefect','1900-08-01');
INSERT INTO tableA(id,name,dob) VALUES(44,'Tricia McMillan','1959-03-07');
INSERT INTO tableA(id,name,dob) VALUES(45,'Zaphod Beeblebrox','1900-02-01');
CREATE TABLE tableB (
id INTEGER
, name VARCHAR(20)
, dob DATE
, PRIMARY KEY (id)
)
;
INSERT INTO tableB(id,name,dob) VALUES(43,'Ford Prefect','1900-08-01');
INSERT INTO tableB(id,name,dob) VALUES(44,'Tricia McMillan','1959-03-07');
INSERT INTO tableB(id,name,dob) VALUES(45,'Zaphod Beeblebrox','1900-02-01');
Ok. Scenario ready.
Merge ....
MERGE
INTO tableB t
USING tableA s
ON s.id = t.id
WHEN NOT MATCHED THEN INSERT (
id
, name
, dob
) VALUES (
s.id
, s.name
, s.dob
);
42000:1:-3500:[Microsoft][ODBC Microsoft Access Driver]
Invalid SQL statement;
expected 'DELETE', 'INSERT', 'PROCEDURE', 'SELECT', or 'UPDATE'.
So, Merge not supported.
Trying something else:
INSERT INTO tableB
SELECT * FROM tableA
WHERE id <> ALL (SELECT id FROM tableB)
;
1 row inserted
Or:
-- UNDO the previous insert
DELETE FROM tableB WHERE id=42;
1 row deleted
-- retry ...
INSERT INTO tableB
SELECT * FROM tableA
WHERE id NOT IN (SELECT id FROM tableB)
;
1 row inserted
You could run it like here above.
Or, if your insert into Table A, from Python, was:
INSERT INTO tableA(id,name,dob) VALUES(?,?,?);
... and you supplied the values for id, name and dob via host variables,
you could continue with:
INSERT INTO tableB
SELECT * FROM tableA a
WHERE id=?
AND NOT EXISTS(
SELECT * FROM tableB WHERE id=a.id
);
You would still have the value 42 in the first host variable, and could just reuse it. It would be faster this way in case of single row inserts.
Should you perform mass inserts, then, I would insert all new rows to table A, then run the INSERT ... WHERE ... NOT IN or the INSERT ... WHERE id <> ALL .... .

Categories

Resources