PySpark - Get the inserted row id after writing to PostgrSQL DB - python

I'm using PySpark to write a DataFrame to a PostgreSQL database via JDBC command below. How can I get the inserted row id ? which is set as identity column with auto-increment.
I'm using below command, not a for-loop inserting each row separately.
df.write.jdbc(url=url, table="table1", mode=mode, properties=properties)
I know I can use monotonicallyIncreasingId and set the IDs within Spark, but I'm looking for an alternative where the DB handles the assignment, but I want to get he IDs back to use in other DataFrames.
I didn't find this in the documentation.

The easiest way will be to query the table that you created and read that into a new data frame.
Alternatively, as you iterate over each row in either a for loop or generator, before you close the loop, fetch the ID of the record you just created and append each ID to a new column in the dataframe.

Related

SQLAlchemy sqlite3 remove value from JSON column on multiple rows with different JSON values

Say I have an id column that is saved as ids JSON NOT NULL using SQLAlchemy, and now I want to delete an id from this column. I'd like to do several things at once:
query only the rows who have this specific ID
delete this ID from all rows it appears in
a bonus, if possible - delete the row if the ID list is now empty.
For the query, something like this:
db.query(models.X).filter(id in list(models.X.ids)) should work.
now, I'd rather avoid iterating over each query and then send an update request as it can be multiple rows. Is there any elegant way to do this?
Thanks!
For the search and remove remove part you can use json_remove function (from SQLLite built-in functions)
from sqlalchemy import func
db.query(models.X).update({'ids': func.json_remove(models.X.ids,f'$[{TARGET_ID}]') })
Here replace TARGET_ID by the targeted id.
Now this will update the row 'silently' (wether or not this id is present in the array).
If you want to first check if target id is in the column: you can query first all rows containing the target id with json_extract query (calling .all() method and then remove those ids with an .update() call.
But this will cost you double amount of queries (less performant).
For the delete part, you can use the json_array_length built-in function
from sqlalchemy import func
db.query(models.X).filter(func.json_array_length(models.X.ids) == 0).delete()
FYI : Not sure that you can do both in one query, and even if possible, I would not do it for clean syntax, logging and monitoring reasons.

pyodbc - write a new column of data to existing table in ms access

I have a ms access db I've connected to with (ignore the ... in the drive name, it's working):
driver = 'DRIVER={...'
con = pyodbc.connect(driver)
cursor = con.cursor()
I have a pandas dataframe which is exactly the same as a table in the db except there's an additional column. Basically I pulled the table with pyodbc, merged it with external excel data to add this additional column, and now want to push the data back to the ms access table with the new column. The pandas df containing the new information is merged_df['Item']
Trying things like below does not work, I've had a variety of errors.
cursor.execute("insert into ToolingData(Item) values (?)", merged_df['Item'])
con.commit()
How can I push the new column to the original table? Can I just write over the entire table instead? Would that be easier? Since merged_df is literally the same thing with the addition of one new column.
If the target MS Access table does not already contain a field to house the data held within the additional column, you'll first need to execute an alter table statement to add the new field.
For example, the following will add a 255-character text field called item to the table ToolingData:
alter table ToolingData add column item text(255)

Is there any way to insert data at the bottom of the table?

I created a table importing data from a csv file into a SQL Server table. The table contains about 6000 rows that are all float. I am trying to insert a new row using INSERT (I am using Python/Spyder and SQL Server Management Studio) and it does insert the row but not at the bottom of the table but towards the middle. I have no idea why it does that. This is the code that I am using:
def create (conn):
print ("Create")
cursor = conn.cursor()
cursor.execute ("insert into PricesTest
(Price1,Price2,Price3,Price4,Price5,Price6,Price7,Price8,Price9,Price10,Price
11,Price12) values (?,?,?,?,?,?,?,?,?,?,?,?);",
(46,44,44,44,44,44,44,44,44,44,44,44))
conn.commit()
read (conn)
Any idea why this is happening? What I should add to my code to "force" that row to be added at the bottom of the table? Many thanks.
I managed to sort it out following different suggestions posted here. Basically I was conceptually wrong to think that tables in MS SQL have an order. I am now working with the data in my table using the ORDER BY dates (I added dates as my first column) and works well. Many thanks all for your help!!
The fact is that the new rows are inserted without any order by default because the server has no rule to order the newly inserted rows (there is no primary key defined). You should have created an identity column before importing your data (even you can do it now):
Id Int IDENTITY(1,1) primary key
This will ensure all rows will be added at the end of the table.
More info on the data type you could use on w3school : https://www.w3schools.com/sql/sql_datatypes.asp

SQLite3 Delete row using auto ID using Python

I'm using Python, Pandas and SQLite3. I've read data from a CSV into a Dataframe and then used to_sql to store the data from Pandas in a SQLite3 database table.
I didn't set an index so SQLite3 used an auto index for each row.
Is it possible for me to use that auto index when I want to delete a row from my Python code?
I want to use something like "DELETE FROM prod_backup WHERE id=4;" but I'm getting an error saying no column called id. When I use a database browser, the auto index column doesn't actually have a column name so I'm not sure how to actually reference it.
Any help would be greatly appreciated. Thanks

Insert into Vertica if table does not exist or not a duplicate row

I have written a python script to create a table using a create table if not exists statement and then insert rows from dataframe into vertica database. For the first time when I run this python script, I want it to create a table and insert the data - it works fine.
But from next time onwards, I want it to create a table only if it does not exist (works fine) and insert data only if that row is not contained in the database.
I use both insert statement and COPY statement to insert data. How to do this in python ? I am accessing Vertica database from python using pyodbc.
Editing the post to include some code :
There is a dataframe called tableframe_df , from which I need to populate content into a table created as bellow:
I am creating a table in vertica with create table if not exists, which creates a table if there is not one.
cursor.execute("create table if not exists <tablename> (fields in the table)")
COPY statement to write to this table from a csv that was created
`cursor.execute("COPY tablename1 FROM LOCAL 'tablename.csv' DELIMITER ',' exceptions 'exceptions' rejected data 'rejected'")`
##for i,row in tablename_df.iterrows():
cursor.execute("insert into tablename2 values(?,?,?,?,?,?,?,?,?,?,?,?)",row.values[0],row.values[1],row.values[2],row.values[3],row.values[4],row.values[5],row.values[6],row.values[7],row.values[8],row.values[9],row.values[10],row.values[11])
Here in the above code, I am creating table and then inserting into tablename1 and tablename2 using COPY and insert. This works fine when executed the first time ( as there is no data in the table). Now by mistake if I run the same script twice, the data will be inserted twice in these tables. What check should I perform to ensure that data does not get inserted if it is already present?
First I'll mention that INSERT VALUES is pretty slow if you are doing a lot of rows. If you are using batch sql and the standard vertica drivers, it should convert it to a COPY but if it doesn't then your inserts might take forever. I don't think this will happen with pyodbc since they don't implement executemany() optimally. You might be able to with ceodbc though, but I haven't tried it. Alternatively, you can use vertica_python which has a .copy('COPY FROM STDIN...',data) command that is efficient.
Anyhow, for your question...
You can do it one of two ways. Also for the inserts, I would really try to change this to a copy or at least an executemany. Again, pydobc does not do this properly, at least for the releases that I have used.
Use a control table that somehow uniquely describe the set of data being loaded and insert into it and check before running the script that the data set has not been loaded.
--Step 1. Check control table for data set load.
SELECT *
FROM mycontroltable
WHERE dataset = ?
--Step 2. If row not found, insert rows
for row in data:
cursor.execute('INSERT INTO mytargettable....VALUES(...)')
-- Step 3. Insert row into control table
INSERT INTO mycontroltable( dataset ) VALUES ( ? )
-- Step 4. Commit data
COMMIT
Alternatively you can insert or merge data in based on a key. You can create a temp or other staging table to do it. If you don't want updates and data does not change once inserted, then INSERT will be better as it will not incur a delete vector. I'll do INSERT based on the way you phrased your question.
--Step 1. Create local temp for intermediate target
CREATE LOCAL TEMP TABLE mytemp (fields) ON COMMIT DELETE ROWS;
--Step 2. Insert data.
for row in data:
cursor.execute('INSERT INTO mytemp....VALUES(...)')
--Step 3. Insert/select only data that doesn't exist by key value
INSERT INTO mytargettable (fields)
SELECT fields
FROM mytemp
WHERE NOT EXISTS (
SELECT 1
FROM mytargettable t
WHERE t.key = mytemp.key
)
--Step 4. Commit
COMMIT;

Categories

Resources