Python 3.6 connection to MS SQL Server for large dataframe

Python 3.6 connection to MS SQL Server for large dataframe - python

I am a new Python coder and also a new data scientist so please forgive any foolish sounding things here. I'll keep the details out unless anyone's curious but basically I need to connect to Microsoft SQL Server and upload a Pandas DF that is relatively large (~500k rows) and I need to do this almost every day as the project currently stands.
It doesn't have to be a Pandas DF - I've read about using odo for csv files but I haven't been able to get anything to work. The issue I'm having is that I can't bulk insert the DF because the file isn't on the same machine as the SQL Server instance. I'm consistently getting errors like the following:
pyodbc.ProgrammingError: ('42000', "[42000] [Microsoft][ODBC SQL
Server Driver][SQL Server]Incorrect syntax near the keyword 'IF'.
(156) (SQLExecDirectW)")
As I've attempted different SQL statements you can replace IF with whatever has been the first COL_NAME in the CREATE statement. I'm using SQLAlchemy to create the engine and connect to the database. This may go without saying but the pd.to_sql() method is just way too slow for how much data I'm moving so that's why I need something faster.
I'm using Python 3.6 by the way. I've put down here most of the things that I've tried that haven't been successful.
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('test_col'))
address = 'mssql+pyodbc://uid:pw#server/path/database?driver=SQL Server'
engine = create_engine(address)
connection = engine.raw_connection()
cursor = connection.cursor()
# Attempt 1 <- This failed to even create a table at the cursor_execute statement so my issues could be way in the beginning here but I know that I have a connection to the SQL Server because I can use pd.to_sql() to create tables successfully (just incredibly slowly for my tables of interest)
create_statement = """
DROP TABLE test_table
CREATE TABLE test_table (test_col)
"""
cursor.execute(create_statement)
test_insert = '''
INSERT INTO test_table
(test_col)
values ('abs');
'''
cursor.execute(test_insert)
Attempt 2 <- From iabdb WordPress blog I came across
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
records = [str(tuple(x)) for x in take_rates.values]
insert_ = """
INSERT INTO test_table
("A")
VALUES
"""
for batch in chunker(records, 2): # This would be set to 1000 in practice I hope
print(batch)
rows = str(batch).strip('[]')
print(rows)
insert_rows = insert_ + rows
print(insert_rows)
cursor.execute(insert_rows)
#conn.commit() # don't know when I would need to commit
conn.close()
# Attempt 3 # From a related Stack Exchange Post
create the table but first drop if it already exists
command = """DROP TABLE IF EXISTS test_table
CREATE TABLE test_table # these columns are from my real dataset
"Serial Number" serial primary key,
"Dealer Code" text,
"FSHIP_DT" timestamp without time zone,
;"""
cursor.execute(command)
connection.commit()
# stream the data using 'to_csv' and StringIO(); then use sql's 'copy_from' function
output = io.StringIO()
# ignore the index
take_rates.to_csv(output, sep='~', header=False, index=False)
# jump to start of stream
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
# null values become ''
cur.copy_from(output, 'Config_Take_Rates_TEST', null="")
connection.commit()
cur.close()
It seems to me that MS SQL Server is just not a nice Database to play around with...
I want to apologize for the rough formatting - I've been at this script for weeks now but just finally decided to try to organize something for StackOverflow. Thank you very much for any help anyone can offer!

If you only need to replace the existing table, truncate it and use bcp utility to upload the table. It's much faster.
from subprocess import call
command = "TRUNCATE TABLE test_table"
take_rates.to_csv('take_rates.csv', sep='\t', index=False)
call('bcp {t} in {f} -S {s} -U {u} -P {p} -d {db} -c -t "{sep}" -r "{nl}" -e {e}'.format(t='test_table', f='take_rates.csv', s=server, u=user, p=password, db=database, sep='\t', nl='\n')
You will need to install bcp utility (yum install mssql-tools on CentOS/RedHat).

'DROP TABLE IF EXISTS test_table' just looks like invalid tsql syntax.
you can do something like this:
if (object_id('test_table') is not null)
DROP TABLE test_table

Related

Azure Timer Function API Call to Azure SQL

I am VERY new to Azure and Azure functions, so be gentle. :-)
I am trying to write an Azure timer function (using Python) that will take the results returned from an API call and insert the results into a table in Azure SQL.
I am virtually clueless. If someone would be willing to handhold me through the process, it would be MOST appreciated.
I have the API call already written, so that part is done. What I totally don't get is how to get the results from what is returned into Azure SQL.
The result set I am returning is in the form of a Pandas dataframe.
Again, any and all assistance would be AMAZING!
Thanks!!!!

Here is an example that writes a panda data structure to and SQL Table:
import pyodbc
import pandas as pd
# insert data from csv file into dataframe.
# working directory for csv file: type "pwd" in Azure Data Studio or Linux
# working directory in Windows c:\users\username
df = pd.read_csv("c:\\user\\username\department.csv")
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()
To make it work for your case you need to:
replace the read from csv file with your function call
Change the insert statement to match the structure of your SQL Table.
For more details see: https://learn.microsoft.com/en-us/sql/machine-learning/data-exploration/python-dataframe-sql-server?view=sql-server-ver15

extract millions of records from sql server and load into oracle database using python script

I am extracting millions of data from sql server and inserting into oracle db using python. It is taking 1 record to insert in oracle table in 1 sec.. takes hours to insert. What is the fastest approach to load ?
My code below:
def insert_data(conn,cursor,query,data,batch_size = 10000):
recs = []
count = 1
for rec in data:
recs.append(rec)
if count % batch_size == 0:
cursor.executemany(query, recs,batcherrors=True)
conn.commit()`enter code here`
recs = []
count = count +1
cursor.executemany(query, recs,batcherrors=True)
conn.commit()

Perhaps you cannot buy a 3d Party ETL tool, but you can certainly write a procedure in PL/SQL in the oracle database.
First, install the oracle Transparenet Gateway for ODBC. No license cost involved.
Second, in the oracl db, create a db link to reference the MSSQL database via the gateway.
Third, write a PL/SQL procedure to pull the data from the MSSQL database, via the db link.
I was once presented a problem similar to yours. developer was using SSIS to copy around a million rows from mssql to oracle. Taking over 4 hours. I ran a trace on his process and saw that it was copying row-by-row, slow-by-slow. Took me less than 30 minutes write a pl/sql proc to copy the data, and it completed in less than 4 minutes.
I give a high-level view of the entire setup and process, here:
EDIT:
Thought you might like to see exactly how simple the actual procedure is:
create or replace my_load_proc
begin
insert into my_oracle_table (col_a,
col_b,
col_c)
select sql_col_a,
sql_col_b,
sql_col_c
from mssql_tbl#mssql_link;
end;
My actual procedure has more to it, dealing with run-time logging, emailing notification of completion, etc. But the above is the 'guts' of it, pulling the data from mssql into oracle.

then you might wanna use pandas or pyspark or other big data frameworks available on python
there are a lot of example out there, here is how to load data from Microsoft Docs:
import pyodbc
import pandas as pd
import cx_Oracle
server = 'servername'
database = 'AdventureWorks'
username = 'yourusername'
password = 'databasename'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
query = "SELECT [CountryRegionCode], [Name] FROM Person.CountryRegion;"
df = pd.read_sql(query, cnxn)
# you do data manipulation that is needed here
# then insert data into oracle
conn = create_engine('oracle+cx_oracle://xxxxxx')
df.to_sql(table_name, conn, index=False, if_exists="replace")
something like that, ( that might not work 100% , but just to give you an idea how you can do it)

snowflake connector SQL compilation error: maximum number of expressions in a list exceeded, expected at most 16,384

I'm trying to insert data with Python from SQL Server to Snowflake table. It works in general, but if I want to insert a bigger chunk of data, it gives me an error:
snowflake connector SQL compilation error: maximum number of expressions in a list exceeded, expected at most 16,384
I'm using snowflake connector for Python. So, it works if you want to insert 16384 rows at once. My table has over a million records. I don't want to use csv files.

I was able to insert > 16k recs using sqlalchemy and pandas as:
pandas_df.to_sql(sf_table, con=engine, index=False, if_exists='append', chunksize=16000)
where engine is sqlalchemy.create_engine(...)

This is not the ideal way to load data into Snowflake, but since you specified that you didn't want to create CSV files, you could look into loading the data into a panda dataframe and then use the write_pandas function in the python connector, which will (behind the scenes) leverage a flat file and a COPY INTO statement, which is the fastest way to get data into Snowflake. This issue with this method will likely be that pandas requires a lot of memory on the machine you are running the script on. There is a chunk_size parameter, though, so you can control it with that.
https://docs.snowflake.com/en/user-guide/python-connector-api.html#write_pandas

For Whoever is facing that problem, find below a complete solution to connect and to insert data into Snowflake using Sqlalchemy.
from sqlalchemy import create_engine
import pandas as pd
snowflake_username = 'username'
snowflake_password = 'password'
snowflake_account = 'accountname'
snowflake_warehouse = 'warehouse_name'
snowflake_database = 'database_name'
snowflake_schema = 'public'
engine = create_engine(
'snowflake://{user}:{password}#{account}/{db}/{schema}?warehouse={warehouse}'.format(
user=snowflake_username,
password=snowflake_password,
account=snowflake_account,
db=snowflake_database,
schema=snowflake_schema,
warehouse=snowflake_warehouse,
),echo_pool=True, pool_size=10, max_overflow=20
)
try:
connection = engine.connect()
results = connection.execute('select current_version()').fetchone()
print(results[0])
df.columns = map(str.upper, df_sensor.columns)
df.to_sql('table'.lower(), con=connection, schema='amostraschema', index=False, if_exists='append', chunksize=16000)
finally:
connection.close()
engine.dispose()

Use the executemany with server side binding. This will create files in a staging area and will allow you to insert more than 16384 rows.
con = snowflake.connector.connect(
account='',
user = '',
password = '',
dbname='',
paramstyle = 'qmark')
sql = "insert into tablename (col1, col2) values (?, ?)"
rows = [[1, 2], [3, 4]]
con.cursor().executemany(sql, rows)
See https://docs.snowflake.com/en/user-guide/python-connector-example.html#label-python-connector-binding-batch-inserts for more details
Note: This will not work with client side binding, the %s format.

Is there a way to invalidate metadata and rebuild index from python code in CDSW?

I am using Impyla and Python in the CDSW to query data in HDFS and use it. The problem is sometimes to get all of the data I have to go in and manually click on the "Invalidate all metadata and rebuild index" button in HUE.
Is there a way to do this in the workbench with a library or python code?

I assume you are using something like this to connect to impala via impyla ... try executing the invalidate metadata <table_name> command
from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('INVALIDATE METADATA mytable') # run this
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()

Insert multiple tab-delimited text files into MySQL with Python?

I am trying to create a program that takes a number of tab delaminated text files, and works through them one at a time entering the data they hold into a MySQL database. There are several text files, like movies.txt which looks like this:
1 Avatar
3 Iron Man
3 Star Trek
and actors.txt that looks the same etc. Each text file has upwards of one hundred entries each with an id and corresponding value as seen above. I have found a number of code examples on this site and others but I can't quite get my head around how to implement them in this situation.
So far my code looks something like this ...
import MySQLdb
database_connection = MySQLdb.connect(host='localhost', user='root', passwd='')
cursor = database_connection.cursor()
cursor.execute('CREATE DATABASE library')
cursor.execute('USE library')
cursor.execute('''CREATE TABLE popularity (
PersonNumber INT,
Category VARCHAR(25),
Value VARCHAR(60),
)
''')
def data_entry(categories):
Everytime i try to get the other code I have found working with this I just get lost completely. Hopeing someone can help me out by either showing me what I need to do or pointing me in the direction of some more information.
Examples of the code I have been trying to adapt to my situation are:
import MySQLdb, csv, sys
conn = MySQLdb.connect (host = "localhost",user = "usr", passwd = "pass",db = "databasename")
c = conn.cursor()
csv_data=csv.reader(file("a.txt"))
for row in csv_data:
print row
c.execute("INSERT INTO a (first, last) VALUES (%s, %s), row")
c.commit()
c.close()
and:
Python File Read + Write

MySQL can read TSV files directly using the mysqlimport utility or by executing the LOAD DATA INFILE SQL command. This will be faster than processing the file in python and inserting it, but you may want to learn how to do both. Good luck!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 3.6 connection to MS SQL Server for large dataframe - python

'DROP TABLE IF EXISTS test_table' just looks like invalid tsql syntax. you can do something like this: if (object_id('test_table') is not null) DROP TABLE test_table

Related

Azure Timer Function API Call to Azure SQL

extract millions of records from sql server and load into oracle database using python script

snowflake connector SQL compilation error: maximum number of expressions in a list exceeded, expected at most 16,384

Is there a way to invalidate metadata and rebuild index from python code in CDSW?

Insert multiple tab-delimited text files into MySQL with Python?

Categories

Resources