i always stored into my DB (SQL server) thousands of parameters until some days ago.
I use spyder (Python 3.6).
I updated all packages with conda update --all some days ago and now im not able to import my dataframes into my DB.
--- I Don't want a workaround to split in a 2100- parameters DF ---
I would like to understand what is changed and why and how to come back to a working one.
this is a simple code:
import pyodbc
import sqlalchemy
import numpy as np
import pandas as pd
c = pyodbc.connect("Driver={SQL Server};Server=**;Trusted_Connection=no;Database=*;UID=*;PWD=*;")
cursor = c.cursor()
engine = sqlalchemy.create_engine('mssql+pyodbc://*:*/*?driver=SQL+Server')
df= pd.DataFrame(np.random.randn(5000))
df.to_sql('pr',engine,if_exists= 'append', index=False)
and this is the error:
ProgrammingError: (pyodbc.ProgrammingError) ('42000', '[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request. (8003) (SQLExecDirectW)')
Thanks a lot
Try to limit the chunksize:
df.to_sql('pr',
engine,
chunksize=20,
if_exists= 'append',
index=False)
This worked for me. Think the math for chosing the right chunksize number is:
chunksize = 2100 / your number of columns
There is an open (as of 2018.06.01) issue for this bug in pandas 0.23.
You might want to downgrade to 0.22 which would work as expected.
Under the hood, Pandas uses SQLAlchemy to talk to the Database. SQLAlchemy prepares and executes what are known as parameterized queries. SQLAlchemy will prepare the query as a string with ? as a placeholder for your data. It will end up looking like this:
INSERT INTO my_table
VALUES (?,?,?,?,?, etc.)
To SQL Server the ? is treated as a parameter. SQL Server only allows 2100 of these parameters in a parameterized query.
The issue isn't so much the number of columns in your dataset, its that SQL Server only allows you to include up to 2100 bits of data in the parameterized query. Parameterized queries are used because they separate what is SQL and what is data, an important protection against accidentally executing malicious sql.
As you've discovered, you can chunk the work such that you don't bump into the 2100 limit. Alternatively, you can construct the SQL yourself and execute it, but this is generally frowned upon as a security risk (SQL injection attack).
Related
I have a problem setting which involves data manipulation on an IBMi as/400 database. I'm trying to solve the problem with the help of python and pandas.
For the last days I'm was trying to set up a proper connection to the as/400 db which every combination of package, driver, dialect whatsoever that I could find on SO or Google.
Neither of the solutions is fully working for me. Some are better, while others are not working at all.
Here's the current situation:
I'm able to read and write data through pyodbc. The connection string I'm using is the following:
cstring = urllib.parse.quote("DRIVER={IBM i Access ODBC Driver};SYSTEM=IP;UID=XXX;PWD=YYY;PORT=21;CommitMode=0;SIGNON=4;CCSID=1208;TRANSLATE=1;")
Then I establish the connection like so:
connection = pypyodbc.connect(cstring)
With connection I can read and write data from/to the as400 db through raw SQL statements:
connection.execute("""CREATE TABLE WWNMOD5.temp(
store_id INT GENERATED BY DEFAULT AS IDENTITY NOT NULL,
store_name VARCHAR(150),
PRIMARY KEY (store_id)
)""")
This is, of course, a meaningless example. My goal would be to write a pandas DataFrame to the as400 by using
df.to_sql()
But when trying to do something like this:
df.to_sql('temp', connection, schema='WWNMOD5', chunksize=1000, if_exists='append', index=False)
I get this error:
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ('42000', '[42000] [IBM][System i Access ODBC-Treiber][DB2 für i5/OS]SQL0104 - Token ; ungültig. Gültige Token: <ENDE DER ANWEISUNG>.')
Meaning that an invalid token was used which in this case I believe is the ';' at the end of the SQL statement.
I believe that pandas isn't compatible with the pyodbc package. Therefore, I was also trying to work with the db over sqlalchemy.
With sqlalchemy, I establish the connection like so:
engine= sa.create_engine("iaccess+pyodbc:///?odbc_connect=%s"%cstring)
I also tried to use ibm_db_sa instead of iaccess but the result is always the same.
If I do the same from above with sqlalchemy, that is:
df.to_sql('temp', engine, schema='WWNMOD5', chunksize=1000, if_exists='append', index=False)
I don't get any error message but the table is not created either and I don't know why.
Is there a way how to get this working? All the SO threads are only suggesting solutions for establishing a connection and reading data from as400 databases but don't cover writing data back to the as400 db via python.
It looks like you are using the wrong driver. Pandas claims support for any DB supported by SQLAlchemy. But in order for SQLAlchemy to use DB2, it needs a third party extension.
This is the one recommended by SQLAlchemy: https://pypi.org/project/ibm-db-sa/
I'm trying to insert some data on a DB2 table on an IBM iSeries (AS400) server, using sqlalchemy, pyodbc and the iaccess packages.
The server allows me to run SELECT and CREATE queries, but when I try to insert rows i get the following error:
sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('HY000', "[REDACTED]
SQL7008 - TABLE in DATABASE not valid for operations. (-7008)
(SQLExecDirectW)")
I'm executing the following query
INSERT INTO database.table VALUES ('A', 'B', 'C')
I know the query works because I am able to run it using the same credentials from Aqua data studio, a db management IDE.
I'm using the following python code to connect to the db:
from sqlalchemy import create_engine
import pandas as pd
engine_statement = f"iaccess+pyodbc://{user}:{pwd}#{server}/{schema_name}?DRIVER={driver}"
connection = create_engine(engine_statement)
I tried using ibmi instead of iaccess+pyodbc but nothing changes.
The closest queztion I found asks the same thing but using Java.
I tried implementing the answer there in python, by setting the isolation_level option to all possible values, but still nothing changes.
I'm not 100% sure how journaling works and therefore how to use it so I was not able to implement point 2 in the answer.
If it may help, I am able to create new tables, but not write on them, which seems surprising, but I'm no sql expert so I guess I'm missing something.
I learned from a helpful post on StackOverflow about how to call stored procedures on SQL Server in python (pyodbc). After modifying my code to what is below, I am able to connect and run execute() from the db_engine that I created.
import pyodbc
import sqlalchemy as sal
from sqlalchemy import create_engine
import pandas as pd
import urllib
params = urllib.parse.quote_plus(
'DRIVER={ODBC Driver 17 for SQL Server};'
f'SERVER=myserver.com;'
f'DATABASE=mydb;'
f'UID=foo;'
f'PWD=bar')
cobnnection_string = f'mssql+pyodbc:///?odbc_connect={params}'
db_engine = create_engine(connection_string)
db_engine.execute("EXEC [dbo].[appDoThis] 'MYDB';")
<sqlalchemy.engine.result.ResultProxy at 0x1121f55e0>
db_engine.execute("EXEC [dbo].[appDoThat];")
<sqlalchemy.engine.result.ResultProxy at 0x1121f5610>
However, even though no errors are returned after running the above code in Python, when I check the database, I confirm that nothing has been executed (what is more telling is that the above commands take one or two seconds to complete whereas running these stored procedures successfully on the database admin tool takes about 5 minutes).
How should I understand what is not working correctly in the above setup in order to properly debug? I literally run the exact same code through my database admin tool with no issues - the stored procedures execute as expected. What could be preventing this from happening via Python? Does the executed SQL need to be committed? Is there a way to debug using the ResultProxy that is returned? Any advice here would be appreciated.
Calling .execute() directly on an Engine object is an outdated usage pattern and will emit deprecation warnings starting with SQLAlchemy version 1.4. These days the preferred approach is to use a context manager (with block) that uses engine.begin():
import sqlalchemy as sa
# …
with engine.begin() as conn: # transaction starts here
conn.execute(sa.text("EXEC [dbo].[appDoThis] 'MYDB';"))
# On exiting the `with` block the transaction will automatically be committed
# if no errors have occurred. If an error has occurred the transaction will
# automatically be rolled back.
Notes:
When passing an SQL command string it should be wrapped in a SQLAlchemy text() object.
SQL Server stored procedures (and anonymous code blocks) should begin with SET NOCOUNT ON; in the overwhelming majority of cases. Failure to do so can result in legitimate results or errors getting "stuck behind" any row counts that may have been emitted by DML statements like INSERT, UPDATE, or DELETE.
I have a pandas DataFrame in python and want this DataFrame directly to be written into a Netezza Database.
I would like to use the pandas.to_sql() method that is described here but it seems like that this method needs one to use SQLAlchemy to connect to the DataBase.
The Problem: SQLAlchemy does not support Netezza.
What I am using at the moment to connect to the database is pyodbc. But this o the other hand is not understood by pandas.to_sql() or am I wrong with this?
My workaround to this is to write the DataFrame into a csv file via pandas.to_csv() and send this to the Netezza Database via pyodbc.
Since I have big data, writing the csv first is a performance issue. I actually do not care if I have to use SQLAlchemy or pyodbc or something different but I cannot change the fact that I have a Netezza Database.
I am aware of deontologician project but as the author states itself "is far from complete, has a lot of bugs".
I got the package to work (see my solution below). But if someone nows a better solution, please let me know!
I figured it out. For my solution see accepted answer.
Solution
I found a solution that I want to share for everyone with the same problem.
I tried the netezza dialect from deontologician but it does not work with python3 so I made a fork and corrected some encoding issues. I uploaded to github and it is available here. Be aware that I just made some small changes and that is mostly work of deontologician and nobody is maintaining it.
Having the netezza dialect I got pandas.to_sql() to work directy with the Netezza database:
import netezza_dialect
from sqlalchemy import create_engine
engine = create_engine("netezza://ODBCDataSourceName")
df.to_sql("YourDatabase",
engine,
if_exists='append',
index=False,
dtype=your_dtypes,
chunksize=1600,
method='multi')
A little explaination to the to_sql() parameters:
It is essential that you use the method='multi' parameter if you do not want to take pandas for ever to write in the database. Because without it it would send an INSERT query per row. You can use 'multi' or you can define your own insertion method. Be aware that you have to have at least pandas v0.24.0 to use it. See the docs for more info.
When using method='multi' it can happen (happend at least to me) that you exceed the parameter limit. In my case it was 1600 so I had to add chunksize=1600 to avoid this.
Note
If you get a warning or error like the following:
C:\Users\USER\anaconda3\envs\myenv\lib\site-packages\sqlalchemy\connectors\pyodbc.py:79: SAWarning: No driver name specified; this is expected by PyODBC when using DSN-less connections
"No driver name specified; "
pyodbc.InterfaceError: ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
Then you propably treid to connect to the database via
engine = create_engine(netezza://usr:pass#address:port/database_name)
You have to set up the database in the ODBC Data Source Administrator tool from Windows and then use the name you defined there.
engine = create_engine(netezza://ODBCDataSourceName)
Then it should have no problems to find the driver.
I know you already answered the question yourself (thanks for sharing the solution)
One general comment about large data-writes to Netezza:
I’d always choose to write data to a file and then use the external table/ODBC interface to insert the data. Instead of inserting 1600 rows at a time, you can probably insert millions of rows in the same timeframe.
We use UTF8 data in the flat file and CSV unless you want to load binary data which will probably require fixed width files.
I’m not a python savvy but I hope you can follow me ...
If you need a documentation link, you can start here: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_syntax.html
I'm calling extremely simple query from Python program using pymsqsql library.
with self.conn.cursor() as cursor:
cursor.execute('select extra_id from mytable where id = %d', id)
extra_id = cursor.fetchone()[0]
Note that parameter binding is used as described in pymssql documentation.
One of the main goals of parameter binding is providing ability for DBMS engine to cache the query plan. I connected to MS SQL with Profiler and checked what queries are actually executed. It turned out that each time a unique statement gets executed (with its own bound ID). I also checked query usage with such query:
select * from sys.dm_exec_cached_plans ec
cross apply
sys.dm_exec_sql_text(ec.plan_handle) txt
where txt.text like '%select extra_id from mytable where id%'
And it shown that the plan is not reused (which is expectable of course due to unique text of each query). This differs much from parameter binding when querying from C#, when we can see that queries are the same but the supplied parameters are different.
So I wonder if I am using pymssql correctly and whether this lib is appropriate for using with MS SQL DBMS.
P.S. I know that MS SQL has a feature of auto-parameterization which works for basic queries, but it is not guarantied, and may not work for complex queries.
You are using pymssql correctly. It is true that pymssql actually does substitute the parameter values into the SQL text before sending the query to the server. For example:
pymssql:
SELECT * FROM tablename WHERE id=1
pyodbc with Microsoft's ODBC Driver for SQL Server (not the FreeTDS ODBC driver):
exec sp_prepexec #p1 output,N'#P1 int',N'SELECT * FROM tablename WHERE id=#P1',1
However, bear in mind that pymssql is based on FreeTDS and the above behaviour appears to be a function of the way FreeTDS handles parameterized queries, rather than a specific feature of pymssql per se.
And yes, it can have implications for the re-use of execution plans (and hence performance) as illustrated in this answer.