Pandas 0.24 read_sql operational errors - python

I just upgraded to Pandas 0.24.0 from 0.23.4 (Python 2.7.12), and many of my pd.read_sql queries are breaking. It looks like something related to MySQL, but it's strange that these errors only occur after updating my pandas version. Any ideas what's going on?
Here's my MySQL table:
CREATE TABLE `xlations_topic_update_status` (
`run_ts` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Here's my query:
import pandas as pd
from sqlalchemy import create_engine
db_engine = create_engine('mysql+mysqldb://<><>/product_analytics', echo=False)
pd.read_sql('select max(run_ts) from product_analytics.xlations_topic_update_status', con = db_engine).values[0][0]
And here's the error:
OperationalError: (_mysql_exceptions.OperationalError) (1059, "Identifier name 'select max(run_ts) from product_analytics.xlations_topic_update_status;' is too long") [SQL: 'DESCRIBE `select max(run_ts) from product_analytics.xlations_topic_update_status;`']
I've also gotten this for other more complex queries, but won't post them here.

According to documentation the first argument is either a string (a table name) or SQLAlchemy Selectable (select or text object). In other words pd.read_sql() is delegating to pd.read_sql_table() and treating the entire query string as a table identifier.
Wrap your query string in a text() construct first:
stmt = text('select max(run_ts) from product_analytics.xlations_topic_update_status')
pd.read_sql(stmt, con = db_engine).values[0][0]
This way pd.read_sql() will delegate to pd.read_sql_query() instead. Another option is to call it directly.

Try using pd.read_sql_query(sql, con), instead of pd.read_sql(...).
So:
pd.read_sql_query('select max(run_ts) from product_analytics.xlations_topic_update_status', con = db_engine).values[0][0]

Related

Passing a non-executable SQL object to pandas read_sql() method

I'm facing this new warning within some Python 3.9 code:
/usr/local/lib/python3.9/site-packages/pandas/io/sql.py:761:
UserWarning:
pandas only support SQLAlchemy connectable(engine/connection) or
database string URI or sqlite3 DBAPI2 connectionother DBAPI2
objects are not tested, please consider using SQLAlchemy
on such snippet:
import pandas as pd
from psycopg2 import sql
fields = ('object', 'category', 'number', 'mode')
query = sql.SQL("SELECT {} FROM categories;").format(
sql.SQL(', ').join(map(sql.Identifier, fields))
)
df = pd.read_sql(
sql=query,
con=connector() # custom function which returns db parameters as a psycopg2 connection object
)
It works like a charm for the moment, but according to the warning message, I'd like to switch to SQLAlchemy.
But by doing so:
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://', creator=connector)
df = pd.read_sql(
sql=query,
con=engine
)
it says:
sqlalchemy.exc.ObjectNotExecutableError: Not an executable object:
Composed([SQL('SELECT '), Composed([Identifier('object'), SQL(', '),
Identifier('category'), SQL(', '), Identifier('number'), SQL(', '),
Identifier('mode')]), SQL(' FROM categories;')])
So I have to tweak it this way to avoid this error:
engine = create_engine('postgresql+psycopg2://', creator=connector)
conn = connector()
curs = conn.cursor()
df = pd.read_sql(
sql=query.as_string(conn), # non-pythonic, isn't it?
con=engine
)
I'm wondering what's the benefit of using an SQLAlchemy engine with pandas if I have to "decode" the query string using a psycpg2 connection context... (in some specific cases where the query string is a binary string I have to "decode" it by applying .decode('UTF-8')...)
How can I rewrite the DataFrame construction in a proper (i.e. the best) way by using an SQLAlchemy engine with pandas?
The pandas doc is not 100% clear for me:
Parameters
sqlstr or SQLAlchemy Selectable (select or text object)
SQL query to be executed or a table name.
Version info:
python: 3.9
pandas: '1.4.3'
sqlalchemy: '1.4.35'
psycopg2: '2.9.3 (dt dec pq3 ext lo64)'
The query can be expressed in SQLAlchemy syntax like this:
import pandas as pd
import sqlalchemy as sa
fields = ('object', 'category', 'number', 'mode')
# Adjust engine configuration to match your environment.
engine = sa.create_engine('postgresql+psycopg2:///test')
metadata = sa.MetaData()
# Reflect the table from the database.
tbl = sa.Table('categories', metadata, autoload_with=engine)
# Get column objects for column names.
columns = [tbl.c[name] for name in fields]
query = sa.select(*columns)
df = pd.read_sql(sql=query, con=engine)
print(df)

Can I insert a table name by using a query parameter?

I have an SQL Alchemy engine where I try to insert parameters via sqlalchemy.sql.text to protect against SQL injection.
The following code works, where I code variables for the condition and conditions values.
from sqlalchemy import create_engine
from sqlalchemy.sql import text
db_engine = create_engine(...)
db_engine.execute(
text(
'SELECT * FROM table_name WHERE :condition_1 = :condition_1_value'), condition_1="name", condition_1_value="John"
)
).fetchall()
However, when I try to code the variable name for table_name, it returns an error.
from sqlalchemy import create_engine
from sqlalchemy.sql import text
db_engine = create_engine(...)
db_engine.execute(
text(
'SELECT * FROM :table_name WHERE :condition_1 = :condition_1_value'), table_name="table_1", condition_1="name", condition_1_value="John"
)
).fetchall()
Any ideas why this does not work?
EDIT:
I know that it has something to do with the table_name not being a string, but I am not sure how to do it in another way.
Any ideas why this does not work?
Query parameters are used to supply the values of things (usually column values), not the names of things (tables, columns, etc.). Every database I've seen works that way.
So, despite the ubiquitous advice that dynamic SQL is a "Bad Thing", there are certain cases where it is simply necessary. This is one of them.
table_name = "table_1" # NOTE: Do not use untrusted input here!
stmt = text(f'SELECT * FROM "{table_name}" …')
Also, check the results you get from trying to parameterize a column name. You may not be getting what you expect.
stmt = text("SELECT * FROM table_name WHERE :condition_1 = :condition_1_value")
db_engine.execute(stmt, dict(condition_1="name", condition_1_value="John"))
will not produce the equivalent of
SELECT * FROM table_name WHERE name = 'John'
It will render the equivalent of
SELECT * FROM table_name WHERE 'name' = 'John'
and will not throw an error, but it will also return no rows because 'name' = 'John' will never be true.

How can I drop this table using SQLAlchemy?

I am trying to drop a table called 'New'. I currently have the following code:
import pandas as pd
import sqlalchemy
sqlcon = sqlalchemy.create_engine('mssql://ABSECTDCS100TL/AdventureWorks?driver=ODBC+Driver+17+for+SQL+Server')3
df = pd.read_sql_query('SELECT * FROM DimReseller', sqlcon)
df.to_sql('New',sqlcon,if_exists='append', index=False)
sqlalchemy.schema.New.drop(bind=None, checkfirst=False)
I am receiving the error:
AttributeError: module 'sqlalchemy.schema' has no attribute 'New'
Any ideas on what I'm missing here?. Thanks.
You can reflect the table into a Table object and then call its drop method:
from sqlalchemy import Table, MetaData
tbl = Table('New', MetaData(), autoload_with=sqlcon)
tbl.drop(sqlcon, checkfirst=False)
If you want to delete the table using raw SQL, you can do this:
from sqlalchemy import text
with sqlcon.connect() as conn:
# Follow the identifier quoting convention for your RDBMS
# to avoid problems with mixed-case names.
conn.execute(text("""DROP TABLE "New" """))
# Commit if necessary
conn.commit()

DataFrame.to_sql equivalent of using "RETURNING id" with psycopg2?

When inserting data using psycopg2 I can retrieve the id of the inserted row using a RETURNING PostgreSQL statement:
import psycopg2
conn = my_connection_parameters()
curs = conn.cursor()
sql_insert_data_query = (
"""INSERT INTO public.data
(created_by, comment)
VALUES ( %(user)s, %(comment)s )
RETURNING id; # the id is automatically managed by the database.
"""
)
curs.execute(
sql_insert_data_query,
{
"user": 'me',
"comment": 'my comment'
}
)
conn.commit()
data_id = curs.fetchone()[0]
and that's great because I need this id to write other data to, e.g. an associative table.
But when having a large dictionary to write to PostgreSQL (which keys are the column identifiers), it's more convenient to rely on pandas' DataFrame.to_sql() method:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://', creator=my_connection_parameters)
df = pd.DataFrame(my_dict, index=[0]) # this is a "one-row" DataFrame, each column being created from the dict keys
df.to_sql(
name='user_table',
con=engine,
schema='public',
if_exists='append',
index=False
)
but there is no direct way to retrieve the id PostgreSQL has created when this record was actually inserted.
Is there a nice and reliable workaround to get it?
Or should I stick with psycopg2 to write my large dictionary using a SQL query?

"Maximum number of parameters" error with filter .in_(list) using pyodbc

One of our queries that was working in Python 2 + mxODBC is not working in Python 3 + pyodbc; it raises an error like this: Maximum number of parameters in the sql query is 2100. while connecting to SQL Server. Since both the printed queries have 3000 params, I thought it should fail in both environments, but clearly that doesn't seem to be the case here. In the Python 2 environment, both MSODBC 11 or MSODBC 17 works, so I immediately ruled out a driver related issue.
So my question is:
Is it correct to send a list as multiple params in SQLAlchemy because the param list will be proportional to the length of list? I think it looks a bit strange; I would have preferred concatenating the list into a single string because the DB doesn't understand the list datatype.
Are there any hints on why it would be working in mxODBC but not pyodbc? Does mxODBC optimize something that pyodbc does not? Please let me know if there are any pointers - I can try and paste more info here. (I am still new to debugging SQLAlchemy.)
Footnote: I have seen lot of answers that suggest to chunk the data, but because of 1 and 2, I wonder if I am doing the correct thing in the first place.
(Since it seems to be related to pyodbc, I have raised an internal issue in the official repository.)
import sqlalchemy
import sqlalchemy.orm
from sqlalchemy import MetaData, Table
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm.session import Session
Base = declarative_base()
create_tables = """
CREATE TABLE products(
idn NUMERIC(8) PRIMARY KEY
);
"""
check_tables = """
SELECT * FROM products;
"""
insert_values = """
INSERT INTO products
(idn)
values
(1),
(2);
"""
delete_tables = """
DROP TABLE products;
"""
engine = sqlalchemy.create_engine('mssql+pyodbc://user:password#dsn')
connection = engine.connect()
cursor = engine.raw_connection().cursor()
Session = sqlalchemy.orm.sessionmaker(bind=connection)
session = Session()
session.execute(create_tables)
metadata = MetaData(connection)
class Products(Base):
__table__ = Table('products', metadata, autoload=True)
try:
session.execute(check_tables)
session.execute(insert_values)
session.commit()
query = session.query(Products).filter(
Products.idn.in_(list(range(0, 3000)))
)
query.all()
f = open("query.sql", "w")
f.write(str(query))
f.close()
finally:
session.execute(delete_tables)
session.commit()
When you do a straightforward .in_(list_of_values) SQLAlchemy renders the following SQL ...
SELECT team.prov AS team_prov, team.city AS team_city
FROM team
WHERE team.prov IN (?, ?)
... where each value in the IN clause is specified as a separate parameter value. pyodbc sends this to SQL Server as ...
exec sp_prepexec #p1 output,N'#P1 nvarchar(4),#P2 nvarchar(4)',N'SELECT team.prov AS team_prov, team.city AS team_city, team.team_name AS team_team_name
FROM team
WHERE team.prov IN (#P1, #P2)',N'AB',N'ON'
... so you hit the limit of 2100 parameters if your list is very long. Presumably, mxODBC inserted the parameter values inline before sending it to SQL Server, e.g.,
SELECT team.prov AS team_prov, team.city AS team_city
FROM team
WHERE team.prov IN ('AB', 'ON')
You can get SQLAlchemy to do that for you with
provinces = ["AB", "ON"]
stmt = (
session.query(Team)
.filter(
Team.prov.in_(sa.bindparam("p1", expanding=True, literal_execute=True))
)
.statement
)
result = list(session.query(Team).params(p1=provinces).from_statement(stmt))

Categories

Resources