Pandas to_sql doesn't insert any data in my table - python

I am trying to insert some data in a table I have created.
I have a data frame that looks like this:
I created a table:
create table online.ds_attribution_probabilities
(
attribution_type text,
channel text,
date date ,
value float
)
And I am running this python script:
engine = create_engine("postgresql://#e.eu-central-1.redshift.amazonaws.com:5439/mdhclient_encoding=utf8")
connection = engine.raw_connection()
result.to_sql('online.ds_attribution_probabilities', con=engine, index = False, if_exists = 'append')
I get no error, but when I check there are no data in my table. What can be wrong? Do I have to commit or do an extra step?

Try to specify a schema name:
result.to_sql('ds_attribution_probabilities', con=engine,
schema='online', index=False, if_exists='append')

Hopefully this helps someone else. to_sql will fail silently in the form of what looks like a successful insert if you pass a connection object. This is definitely true for Postgres, but i assume the same for others as well, based on the method docs:
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects.
This got me because the typing hints stated Union[Engine, Connection], which is "technically" true.
If you have a session with SQLAlchemy try passing con=session.get_bind(),

I had a similar issue caused by the fact that I was passing sqlalchemy connection object instead of engine object to the con parameter. In my case tables were created but left empty.

In my case, writing data to the database was hampered by the fast option.
Why is this not fast loading interfering, I have not yet figured out.
This code doesn't work:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect={}".format(db_params), fast_executemany=True)
df.to_sql('tablename', engine, index=False, schema = 'dbo', if_exists='replace' )
Without fast_executemany=True the code works well.

Check the autocommit setting: https://docs.sqlalchemy.org/en/latest/core/connections.html#understanding-autocommit
engine.execute(text("SELECT my_mutating_procedure()").execution_options(autocommit=True))

This could happen because it defaults to the public database, and there's probably a table with that name under the public database/schema, with your data in it.
#MaxU's answer does help some, but not the others.
For others, here is something else you can try:
When you create the engine, specify the schemaname like this:
engine = create_engine(*<connection_string>*,
connect_args={'options': '-csearch_path={}'.format(*<dbschema_name>*)})
Link: https://stackoverflow.com/a/49930672/8656608

I faced the same problem when I used .connect() and .begin()
with engine.connect() as conn, conn.begin():
dataframe.to_sql(name='table_name', schema='schema',
con=conn, if_exists='append', index=False)
conn.close()
Just remove the .connect() and .begin() and it will work.

Related

How to do SQL Server transactions with Python through SQLAlchemy without blocking access to db tables in SQL Server Management Studio?

I've been stuck for a while now on this question and wasn't able to find the right answer/topic on the internet.
Basically, I have a table on SQL Server that I try to 'replace' with an updated one in a pandas dataframe form. I need to do a transaction for this task so that my original table isn't lost if something goes wrong while transferring data from dataframe (rollback functionality). I found a solution for this - SQLAlchemy library.
My code for this:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0")
with engine.begin() as conn:
df.to_sql(name = 'table_name', schema = 'db_schema', con = conn, if_exists = 'replace', index = False)
The problem occurs when I try to access the tables in this specific database through SQL Server Management Studio 18 while doing this transaction because it somehow manages to block the whole database and no one can access any tables in it (access time limits exceeded). The code above works great, I've tried to transfer a small chunk of dataframe, but the problem still persists, because I need to transfer a large dataframe.
What I've tried:
The concept of isolation levels, but this isn't the right thing as it's about the rules of connecting to a table that is already being used.
Example:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0", isolation_level="SERIALIZABLE")
Adjusting such parameters as pool_size and max_overflow in create_engine() statement and chunksize in df.to_sql() statement but they don't seem to have an effect. Example:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0", pool_size = 1, max_overflow = 0)
with engine.begin() as conn: df.to_sql(name = 'table_name', schema = 'db_schema', con = conn, if_exists = 'replace', chunksize = 1, index = False)
Excluding schema parameter from df.to_sql() query doesn't work either
Basic SQL code and functionality I'm trying to achieve for this task would look something like this:
BEGIN TRANSACTION
BEGIN TRY
DELETE FROM [db].[schema].[table];
INSERT INTO [db].[schema].[table] <--- dataframe
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
SELECT ERROR_NUMBER() AS [Error_number], ERROR_MESSAGE() AS [Error_description]
END CATCH
I could create another buffer table and parse df data into it and do a transaction of this table afterwards, but I'm looking for a solution to bypass these steps.
If there is a better way to do this task please let me know as well.
As suggested by #GordThompson, the right solution considering that your db table already exists is as follows:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0")
# start transaction
with engine.begin() as conn:
# clean the table
conn.exec_driver_sql("TRUNCATE TABLE [db].[schema].[table]")
# append data from df
df.to_sql(name = 'table_name', schema = 'schema_name', con = conn, if_exists = 'append', index = False)

Example of using the 'callable' method in pandas.to_sql()?

I'm trying to make a specific insert statement that has an ON CONFLICT argument (I'm uploading to a Postgres database); will the df.to_sql(method='callable') allow that? Or is it intended for another purpose? I've read through the documentation, but I wasn't able to grasp the concept. I looked around on this website and others for similar questions, but I haven't found one yet. If possible I would love to see an example of how to use the 'callable' method in practice. Any other ideas on how to effectively load large numbers of rows from pandas using ON CONFLICT logic would be much appreciated as well. Thanks in advance for the help!
Here's an example on how to use postgres's ON CONFLICT DO NOTHING with to_sql
# import postgres specific insert
from sqlalchemy.dialects.postgresql import insert
def to_sql_on_conflict_do_nothing(pd_table, conn, keys, data_iter):
# This is very similar to the default to_sql function in pandas
# Only the conn.execute line is changed
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(insert(pd_table.table).on_conflict_do_nothing(), data)
conn = engine.connect()
df.to_sql("some_table", conn, if_exists="append", index=False, method=to_sql_on_conflict_do_nothing)
I have just had similar problem, and followed by to this answer I came up with solution on how to send df to potgresSQL ON CONFLICT:
1. Send some initial data to the database to create the table
from sqlalchemy import create_engine
engine = create_engine(connection_string)
df.to_sql(table_name,engine)
2. add primary key
ALTER TABLE table_name ADD COLUMN id SERIAL PRIMARY KEY;
3. prepare index on the column (or columns) you want to check the uniqueness
CREATE UNIQUE INDEX review_id ON test(review_id);
4. map the sql table with sqlalchemy
from sqlalchemy.ext.automap import automap_base
ABase = automap_base()
Table = ABase.classes.table_name
Table.__tablename__ = 'table_name'
6. do your insert on conflict with:
from sqlalchemy.dialects.postgresql import insert
insrt_vals = df.to_dict(orient='records')
insrt_stmnt = insert(Table).values(insrt_vals)
do_nothing_stmt = insrt_stmnt.on_conflict_do_nothing(index_elements=['review_id'])
results = engine.execute(do_nothing_stmt)

How to prevent aliasing in SQLAlchemy Query

I am using SQLAlchemy to extract data from a SQL Server DB into a Pandas Dataframe:
q: Query = self._session(db).query(tbl_obj)
return pd.read_sql(
str(q),
db.conn()
)
tbl_obj is a SQLAlchemy Table object that has been autloaded from an existing table in the DB.
My problem is that the query that's being created automatically aliases the column names to 'TABLE_NAME_COLUMN_NAME,' when I just want them to be 'COLUMN_NAME.'
I figure this is a fairly simple solution, but I haven't figured it out yet. Any thoughts?
Posting this because I figured it out while I was typing up the question. The problem was that I was calling str(q) when I should have been calling q.statement
This code works as expected, because the 'statement' attribute doesn't include the alisaing:
q: Query = self._session(db).query(tbl_obj)
return pd.read_sql(
q.statement,
db.conn()
)

Insert into postgreSQL table from pandas with "on conflict" update

I have a pandas DataFrame that I need to store into the database. Here's my current line of code for inserting:
df.to_sql(table,con=engine,if_exists='append',index_label=index_col)
This works fine if none of the rows in df exist in my table. If a row already exists, I get this error:
sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key
value violates unique constraint "mypk"
DETAIL: Key (id)=(42) already exists.
[SQL: 'INSERT INTO mytable (id, owner,...) VALUES (%(id)s, %(owner)s,...']
[parameters:...] (Background on this error at: http://sqlalche.me/e/gkpj)
and nothing is inserted.
PostgreSQL has optional ON CONFLICT clause, which could be used to UPDATE the existing table rows. I read entire pandas.DataFrame.to_sql manual page and I couldn't find any way to use ON CONFLICT within DataFrame.to_sql() function.
I have considered spliting my DataFrame in two based on what's already in the db table. So now I have two DataFrames, insert_rows and update_rows, and I can safely execute
insert_rows.to_sql(table, con=engine, if_exists='append', index_label=index_col)
But then, there seems to be no UPDATE equivalent to DataFrame.to_sql(). So how do I update the table using DataFrame update_rows?
I know it's an old thread, but I ran into the same issue and this thread showed up in Google. None of the answers is really satisfying yet, so I here's what I came up with:
My solution is pretty similar to zdgriffith's answer, but much more performant as there's no need to iterate over data_iter:
def postgres_upsert(table, conn, keys, data_iter):
from sqlalchemy.dialects.postgresql import insert
data = [dict(zip(keys, row)) for row in data_iter]
insert_statement = insert(table.table).values(data)
upsert_statement = insert_statement.on_conflict_do_update(
constraint=f"{table.table.name}_pkey",
set_={c.key: c for c in insert_statement.excluded},
)
conn.execute(upsert_statement)
Now you can use this custom upsert method in pandas' to_sql method like zdgriffith showed.
Please note that my upsert function uses the primary key constraint of the table. You can target another constraint by changing the constraint argument of .on_conflict_do_update.
This SO answer on a related thread explains the use of .excluded a bit more: https://stackoverflow.com/a/51935542/7066758
# SaturnFromTitan, thanks for the reply to this old thread. That worked like magic. I would upvote, but I don't have the rep.
For those that are as new to all this as I am:
You can cut and paste SaturnFromTitan answer and call it with something like:
df.to_sql('my_table_name',
dbConnection,schema='my_schema',
if_exists='append',
index=False,
method=postgres_upsert)
And that's it. The upsert works.
To follow up on Brendan's answer with an example, this is what worked for me:
import os
import sqlalchemy as sa
import pandas as pd
from sqlalchemy.dialects.postgresql import insert
engine = sa.create_engine(os.getenv("DBURL"))
meta = sa.MetaData()
meta.bind = engine
meta.reflect(views=True)
def upsert(table, conn, keys, data_iter):
upsert_args = {"constraint": "test_table_col_a_col_b_key"}
for data in data_iter:
data = {k: data[i] for i, k in enumerate(keys)}
upsert_args["set_"] = data
insert_stmt = insert(meta.tables[table.name]).values(**data)
upsert_stmt = insert_stmt.on_conflict_do_update(**upsert_args)
conn.execute(upsert_stmt)
if __name__ == "__main__":
df = pd.read_csv("test_data.txt")
with db.engine.connect() as conn:
df.to_sql(
"test_table",
con=conn,
if_exists="append",
method=upsert,
index=False,
)
where in this example the schema would be something like:
CREATE TABLE test_table(
col_a text NOT NULL,
col_b text NOT NULL,
col_c text,
UNIQUE (col_a, col_b)
)
If you notice in the to_sql docs there's mention of a method argument that takes a callable. Creating this callable should allow you to use the Postgres clauses you need. Here's an example of a callable they mentioned in the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
It's pretty different from what you need, but follow the arguments passed to this callable. They will allow you to construct a regular SQL statement.
If anybody wanted to build on top of the answer from zdgriffith and dynamically generate the table constraint name you can use the following query for postgreSQL:
select distinct tco.constraint_name
from information_schema.table_constraints tco
join information_schema.key_column_usage kcu
on kcu.constraint_name = tco.constraint_name
and kcu.constraint_schema = tco.constraint_schema
and kcu.constraint_name = tco.constraint_name
where kcu.table_name = '{table.name}'
and constraint_type = 'PRIMARY KEY';
You can then format this string to populate table.name inside the upsert() method.
I also didn't require the meta.bind and meta.reflect() lines. The latter will be deprecated soon anyway.

How do I drop a table in SQLAlchemy when I don't have a table object?

I want to drop a table (if it exists) before writing some data in a Pandas dataframe:
def store_sqlite(in_data, dbpath = 'my.db', table = 'mytab'):
database = sqlalchemy.create_engine('sqlite:///' + dbpath)
## DROP TABLE HERE
in_data.to_sql(name = table, con = database, if_exists = 'append')
database.close()
The SQLAlchemy documentation all points to a Table.drop() object - how would I create that object, or equivalently is there an alternative way to drop this table?
Note : I can't just use if_exists = 'replace' as the input data is actually a dict of DataFrames which I loop over - I've suppressed that code for clarity (I hope).
From the panda docs;
"You can also run a plain query without creating a dataframe with execute(). This is useful for queries that don’t return values, such as INSERT. This is functionally equivalent to calling execute on the SQLAlchemy engine or db connection object."
http://pandas.pydata.org/pandas-docs/version/0.18.0/io.html#id3
So I do this;
from pandas.io import sql
sql.execute('DROP TABLE IF EXISTS %s'%table, engine)
sql.execute('VACUUM', engine)
Where "engine" is the SQLAlchemy database object (the OP's "database" above). Vacuum is optional, just reduces the size of the sqlite file (I use the table drop part infrequently in my code).
You should be able to create a cursor from your SQLAlchemy engine
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:///' + dbpath)
connection = engine.raw_connection()
cursor = connection.cursor()
command = "DROP TABLE IF EXISTS {};".format(table)
cursor.execute(command)
connection.commit()
cursor.close()
# Now you can chunk upload your data as you wish
in_data.to_sql(name=table, con=engine, if_exists='append')
If you're loading a lot of data into your db, you may find it faster to use pandas' to_csv() and SQL's copy_from function. You can also use StringIO() to hold it in memory and having to write the file.

Categories

Resources