I am using SQLAlchemy to extract data from a SQL Server DB into a Pandas Dataframe:
q: Query = self._session(db).query(tbl_obj)
return pd.read_sql(
str(q),
db.conn()
)
tbl_obj is a SQLAlchemy Table object that has been autloaded from an existing table in the DB.
My problem is that the query that's being created automatically aliases the column names to 'TABLE_NAME_COLUMN_NAME,' when I just want them to be 'COLUMN_NAME.'
I figure this is a fairly simple solution, but I haven't figured it out yet. Any thoughts?
Posting this because I figured it out while I was typing up the question. The problem was that I was calling str(q) when I should have been calling q.statement
This code works as expected, because the 'statement' attribute doesn't include the alisaing:
q: Query = self._session(db).query(tbl_obj)
return pd.read_sql(
q.statement,
db.conn()
)
Related
I need to select some columns from a table with SQL Alchemy. Everything works fine except selecting the one column with '/' in the name. My query looks like:
query = select([func.sum(Table.c.ColumnName),
func.sum(Table.c.Column/Name),
])
Obviously the issue comes from the second line with the column 'Column/Name'. Is there a way in SQL Alchemy to overcome special characters in a column name?
edit:
I've it all inside some class but simplified version of a process looks like this. I create an engine (all necessary db data is inside create_new_engine() function) and map all tables in db into metadata.
def map(self):
from sqlalchemy.engine.base import Engine
# check if engine exist
if not isinstance(self.engine, Engine):
self.create_new_engine()
self.metadata = MetaData({'schema': 'dbo'})
self.metadata.reflect(bind=self.engine)
Then I map a single table with:
def map_table(self, table_name):
table = "{schema}.{table_name}".format(schema=self.metadata.schema, table_name=table_name)
table = self.metadata.tables[table]
return table
In the end I use pandas read_sql_query to run above query with connection and engine established earlier.
I'm connecting to SQL Server.
Since Table.c points to a plain python obect. Try in pure Python
query = select([func.sum(Table.c.ColumnName),
func.sum(getattr(Table.c, 'Column/Name')),
])
So in your case (from comments above) :
func.sum(getattr(Table.c, 'cur/fees'))
I'm trying to dump a pandas DataFrame into an existing Snowflake table (via a jupyter notebook).
When I run the code below no errors are raised, but no data is written to the destination SF table (df has ~800 rows).
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
sf_engine = create_engine(
URL(
user=os.environ['SF_PROD_EID'],
password=os.environ['SF_PROD_PWD'],
account=account,
warehouse=warehouse,
database=database,
)
)
df.to_sql(
"test_table",
con=sf_engine,
schema=schema,
if_exists="append",
index=False,
chunksize=16000,
)
If I check the SF History, I can see that the queries apparently ran without issue:
If I pull the query from the SF History UI and run it manually in the Snowflake UI the data shows up in the destination table.
If I try to use locopy I run into the same issue.
If the table does not exist before hand, the same code above creates the table and drops the rows no problem.
Here's where it gets weird. When I run the pd.to_sql command to try and append and then drop the destination table, if I then issue a select count(*) from destination_table a table still exists with that name and has (only) the data that I've been trying to drop. Thinking it may be a case-sensitive table naming situation?
Any insight is appreciated :)
Try adding role="<role>" and schema="<schema>" in URL.
engine = create_engine(URL(
account=os.getenv("SNOWFLAKE_ACCOUNT"),
user=os.getenv("SNOWFLAKE_USER"),
password=os.getenv("SNOWFLAKE_PASSWORD"),
role="<role>",
warehouse="<warehouse>",
database="<database>",
schema="<schema>"
))
Issue was due how I set up the database connection and the case-sensitivity of the table name. Turns out that I was writing to a table called DB.SCHEMA."db.schema.test_table" (note that the db.schema slug turns into part of the table name). Don't be like me kids. Use upper-case table names in Snowflake!
I'm trying to make a specific insert statement that has an ON CONFLICT argument (I'm uploading to a Postgres database); will the df.to_sql(method='callable') allow that? Or is it intended for another purpose? I've read through the documentation, but I wasn't able to grasp the concept. I looked around on this website and others for similar questions, but I haven't found one yet. If possible I would love to see an example of how to use the 'callable' method in practice. Any other ideas on how to effectively load large numbers of rows from pandas using ON CONFLICT logic would be much appreciated as well. Thanks in advance for the help!
Here's an example on how to use postgres's ON CONFLICT DO NOTHING with to_sql
# import postgres specific insert
from sqlalchemy.dialects.postgresql import insert
def to_sql_on_conflict_do_nothing(pd_table, conn, keys, data_iter):
# This is very similar to the default to_sql function in pandas
# Only the conn.execute line is changed
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(insert(pd_table.table).on_conflict_do_nothing(), data)
conn = engine.connect()
df.to_sql("some_table", conn, if_exists="append", index=False, method=to_sql_on_conflict_do_nothing)
I have just had similar problem, and followed by to this answer I came up with solution on how to send df to potgresSQL ON CONFLICT:
1. Send some initial data to the database to create the table
from sqlalchemy import create_engine
engine = create_engine(connection_string)
df.to_sql(table_name,engine)
2. add primary key
ALTER TABLE table_name ADD COLUMN id SERIAL PRIMARY KEY;
3. prepare index on the column (or columns) you want to check the uniqueness
CREATE UNIQUE INDEX review_id ON test(review_id);
4. map the sql table with sqlalchemy
from sqlalchemy.ext.automap import automap_base
ABase = automap_base()
Table = ABase.classes.table_name
Table.__tablename__ = 'table_name'
6. do your insert on conflict with:
from sqlalchemy.dialects.postgresql import insert
insrt_vals = df.to_dict(orient='records')
insrt_stmnt = insert(Table).values(insrt_vals)
do_nothing_stmt = insrt_stmnt.on_conflict_do_nothing(index_elements=['review_id'])
results = engine.execute(do_nothing_stmt)
I have a pandas DataFrame that I need to store into the database. Here's my current line of code for inserting:
df.to_sql(table,con=engine,if_exists='append',index_label=index_col)
This works fine if none of the rows in df exist in my table. If a row already exists, I get this error:
sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key
value violates unique constraint "mypk"
DETAIL: Key (id)=(42) already exists.
[SQL: 'INSERT INTO mytable (id, owner,...) VALUES (%(id)s, %(owner)s,...']
[parameters:...] (Background on this error at: http://sqlalche.me/e/gkpj)
and nothing is inserted.
PostgreSQL has optional ON CONFLICT clause, which could be used to UPDATE the existing table rows. I read entire pandas.DataFrame.to_sql manual page and I couldn't find any way to use ON CONFLICT within DataFrame.to_sql() function.
I have considered spliting my DataFrame in two based on what's already in the db table. So now I have two DataFrames, insert_rows and update_rows, and I can safely execute
insert_rows.to_sql(table, con=engine, if_exists='append', index_label=index_col)
But then, there seems to be no UPDATE equivalent to DataFrame.to_sql(). So how do I update the table using DataFrame update_rows?
I know it's an old thread, but I ran into the same issue and this thread showed up in Google. None of the answers is really satisfying yet, so I here's what I came up with:
My solution is pretty similar to zdgriffith's answer, but much more performant as there's no need to iterate over data_iter:
def postgres_upsert(table, conn, keys, data_iter):
from sqlalchemy.dialects.postgresql import insert
data = [dict(zip(keys, row)) for row in data_iter]
insert_statement = insert(table.table).values(data)
upsert_statement = insert_statement.on_conflict_do_update(
constraint=f"{table.table.name}_pkey",
set_={c.key: c for c in insert_statement.excluded},
)
conn.execute(upsert_statement)
Now you can use this custom upsert method in pandas' to_sql method like zdgriffith showed.
Please note that my upsert function uses the primary key constraint of the table. You can target another constraint by changing the constraint argument of .on_conflict_do_update.
This SO answer on a related thread explains the use of .excluded a bit more: https://stackoverflow.com/a/51935542/7066758
# SaturnFromTitan, thanks for the reply to this old thread. That worked like magic. I would upvote, but I don't have the rep.
For those that are as new to all this as I am:
You can cut and paste SaturnFromTitan answer and call it with something like:
df.to_sql('my_table_name',
dbConnection,schema='my_schema',
if_exists='append',
index=False,
method=postgres_upsert)
And that's it. The upsert works.
To follow up on Brendan's answer with an example, this is what worked for me:
import os
import sqlalchemy as sa
import pandas as pd
from sqlalchemy.dialects.postgresql import insert
engine = sa.create_engine(os.getenv("DBURL"))
meta = sa.MetaData()
meta.bind = engine
meta.reflect(views=True)
def upsert(table, conn, keys, data_iter):
upsert_args = {"constraint": "test_table_col_a_col_b_key"}
for data in data_iter:
data = {k: data[i] for i, k in enumerate(keys)}
upsert_args["set_"] = data
insert_stmt = insert(meta.tables[table.name]).values(**data)
upsert_stmt = insert_stmt.on_conflict_do_update(**upsert_args)
conn.execute(upsert_stmt)
if __name__ == "__main__":
df = pd.read_csv("test_data.txt")
with db.engine.connect() as conn:
df.to_sql(
"test_table",
con=conn,
if_exists="append",
method=upsert,
index=False,
)
where in this example the schema would be something like:
CREATE TABLE test_table(
col_a text NOT NULL,
col_b text NOT NULL,
col_c text,
UNIQUE (col_a, col_b)
)
If you notice in the to_sql docs there's mention of a method argument that takes a callable. Creating this callable should allow you to use the Postgres clauses you need. Here's an example of a callable they mentioned in the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
It's pretty different from what you need, but follow the arguments passed to this callable. They will allow you to construct a regular SQL statement.
If anybody wanted to build on top of the answer from zdgriffith and dynamically generate the table constraint name you can use the following query for postgreSQL:
select distinct tco.constraint_name
from information_schema.table_constraints tco
join information_schema.key_column_usage kcu
on kcu.constraint_name = tco.constraint_name
and kcu.constraint_schema = tco.constraint_schema
and kcu.constraint_name = tco.constraint_name
where kcu.table_name = '{table.name}'
and constraint_type = 'PRIMARY KEY';
You can then format this string to populate table.name inside the upsert() method.
I also didn't require the meta.bind and meta.reflect() lines. The latter will be deprecated soon anyway.
I am trying to insert some data in a table I have created.
I have a data frame that looks like this:
I created a table:
create table online.ds_attribution_probabilities
(
attribution_type text,
channel text,
date date ,
value float
)
And I am running this python script:
engine = create_engine("postgresql://#e.eu-central-1.redshift.amazonaws.com:5439/mdhclient_encoding=utf8")
connection = engine.raw_connection()
result.to_sql('online.ds_attribution_probabilities', con=engine, index = False, if_exists = 'append')
I get no error, but when I check there are no data in my table. What can be wrong? Do I have to commit or do an extra step?
Try to specify a schema name:
result.to_sql('ds_attribution_probabilities', con=engine,
schema='online', index=False, if_exists='append')
Hopefully this helps someone else. to_sql will fail silently in the form of what looks like a successful insert if you pass a connection object. This is definitely true for Postgres, but i assume the same for others as well, based on the method docs:
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects.
This got me because the typing hints stated Union[Engine, Connection], which is "technically" true.
If you have a session with SQLAlchemy try passing con=session.get_bind(),
I had a similar issue caused by the fact that I was passing sqlalchemy connection object instead of engine object to the con parameter. In my case tables were created but left empty.
In my case, writing data to the database was hampered by the fast option.
Why is this not fast loading interfering, I have not yet figured out.
This code doesn't work:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect={}".format(db_params), fast_executemany=True)
df.to_sql('tablename', engine, index=False, schema = 'dbo', if_exists='replace' )
Without fast_executemany=True the code works well.
Check the autocommit setting: https://docs.sqlalchemy.org/en/latest/core/connections.html#understanding-autocommit
engine.execute(text("SELECT my_mutating_procedure()").execution_options(autocommit=True))
This could happen because it defaults to the public database, and there's probably a table with that name under the public database/schema, with your data in it.
#MaxU's answer does help some, but not the others.
For others, here is something else you can try:
When you create the engine, specify the schemaname like this:
engine = create_engine(*<connection_string>*,
connect_args={'options': '-csearch_path={}'.format(*<dbschema_name>*)})
Link: https://stackoverflow.com/a/49930672/8656608
I faced the same problem when I used .connect() and .begin()
with engine.connect() as conn, conn.begin():
dataframe.to_sql(name='table_name', schema='schema',
con=conn, if_exists='append', index=False)
conn.close()
Just remove the .connect() and .begin() and it will work.