I have a Python Script that I am running to take data from a CSV and insert it into my MS SQL Server. The CSV is about 35 MB and contains about 200,000 records with 15 columns. It takes the SQL Server Import Tool less than 5 min to insert all the data into the Server. The python script, using pypyodbc takes 30 minutes or longer.
What am I doing wrong with my code that is causing this to take so long
import pypyodbc
import csv
import datetime
now = datetime.datetime.now()
cnxn = pypyodbc.connect('DRIVER={SQL Server};SERVER=;DATABASE=')
cursor = cnxn.cursor()
cursor.execute("""
DELETE FROM DataMaster
""")
cnxn.commit()
FileName = "Data - " + str('{:02d}'.format(now.month)) + "-" + str('{:02d}'.format(now.day-1)) + "-" + str(now.year) + ".csv"
myCSV = open(FileName)
myCSV = csv.reader(myCSV)
next(myCSV, None) # this skips the headers
listlist = list(myCSV)
cursor.executemany('''
INSERT INTO DataMaster (Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13,Column14,Column15)
VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)''',
(
listlist
)
)
cursor.commit() # commits any changes
cnxn.close() # closes the connection
print "Import Completed."
Below code, should work in your case. This should not much time.
import pymysql
import logging
import numpy as np
try:
conn = pymysql.connect(hostname='', user='user', passwd='password', db='dbname', connect_timeout=5)
except Exception as e:
logger.error("ERROR: Unexpected error: Could not connect to MySql instance.")
logger.error(e)
sys.exit()
logger.info("SUCCESS: Connection to SQL Server instance succeeded")
local_cursor = conn.cursor()
path ='path/*/csv'
for files in glob.glob(path + "/*.csv"):
add_csv_file="""LOAD DATA LOCAL INFILE '%s' INTO TABLE tablename FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;;;""" %(files)
cursor.execute(add_csv_file)
cnx.commit()
cursor.close();
cnx.close();
Please select the answer if this works.
Related
Question in the bottom of the post.
I'm on Windows 10. Using MySQL Workbench 8.0CE.
Data is 1014 rows of movie manuscripts.
20x faster is meant literally. It went from 40 minutes on InnoDB, to 2 minutes on MyISAM, running this following python script.
from random import randint
from time import sleep
import requests
from bs4 import BeautifulSoup as bs
import json
import pymysql
import traceback
import logging
from tqdm import tqdm
logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.INFO)
mysql_code = "password"
def getData(id, script_raw):
#print(script_raw)
script_clean = remove_html_tags(script_raw).replace("'","''")
#output_str = ''.join(c for c in script_clean if c.isprintable())
#print(output_str)
#print(script_clean_2)
save_data(script_clean, id)
def remove_html_tags(text):
"""Remove html tags from a string"""
import re
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
def save_data(script_clean, id):
try:
conn = pymysql.connect(host='localhost', user='admin',
passwd=mysql_code, db='manuscriptproject')
cur = conn.cursor()
query = "UPDATE `clean_movie_script` SET `script_clean` = '%s' WHERE (`id` = '%s');"
final_query = query % (script_clean, id)
cur.execute(final_query)
conn.commit()
cur.close()
conn.close()
except Exception as e:
logging.info("Error with query for id : " + str(id))
logging.error(traceback.format_exc())
logging.error(e)
def get_non_populated_records():
conn = pymysql.connect(host='localhost', user='admin',
passwd=mysql_code, db='manuscriptproject')
cur = conn.cursor()
cur.execute(
"SElECT id, script FROM `movie_script` "
"WHERE script IS NOT NULL "
"ORDER BY id asc "
"LIMIT 100000")
data = list(cur.fetchall())
conn.close()
return data
if __name__ == "__main__":
unpopulated_records = get_non_populated_records()
for x in tqdm(unpopulated_records):
try:
getData(x[0], x[1])
except Exception as e:
print(e)
Switching from InnoDB to MyISAM it changed the way it defines my 'id' column from an INT with PK, NN, Unique, and Autoincrement enabled to an INT with only NN enabled, and a default expression of '0'. I also cannot change the engine back to InnoDB now.
Question: I'm trying to understand which engine is the best for my use case. 2 minutes still seems slow for a 200MB database. Searching online, InnoDB should be faster than MyISAM, and it might have something to do with how I define my 'id' column maybe - I just cannot figure it out.
Do this only once for the entire program:
conn = pymysql.connect(...)
(It may be the most costly part of the task, and the code seems to do it inside a loop.)
I have an excel file. Im importing that to dataframe and trying to update a database table using the data.
import pyodbc
def get_sale_file():
try:
cnxn = pyodbc.connect('DRIVER=ODBC Driver 17 for SQL Server;'
'SERVER=' + server + ';DATABASE=' + database + ';UID=' + uname + ';PWD=' + pword,
autocommit=False)
files = os.listdir(ile_path)
df = pd.DataFrame()
for f in files:
if (f.endswith('.xlsx') or f.endswith('.xls')):
df = pd.read_excel(os.path.join(sap_file_path, f))
df.to_sql('temptable', cnxn, if_exists='replace')
query = "UPDATE MList AS mas" + \
" SET TTY = temp.[Territory Code] ," + \
" Freq =temp.[Frequency Code]," + \
" FROM temptable AS temp" + \
" WHERE mas.SiteCode = temp.[ri a]"
When I execute above code block; I get
1/12/2019 10:19:45 AM ERROR: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ('42S02', "[42S02] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name 'sqlite_master'. (208) (SQLExecDirectW)")
Am i trying in right way? Does panads have any other function to update mssql table other than to_sql?
How can I overcome above error?
Edit
Should i have to create temptable beforehand to load datafarme? If that so, my file contains 100s of column, it may vary..(except few columns) How could I make sure pandas to load only few columns to temptable?
According the guide of pandas.DataFrame.to_sql (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html) , the connection expect a connection type sqlalchemy.engine.Engine or sqlite3.Connection , then is necesary change your code using a connection like this :
import sqlalchemy
import pyodbc
cnxn = sqlalchemy.create_engine("mssql+pyodbc://<username>:<password>#<dsnname>")
df.to_sql("table_name", cnxn,if_exists='replace')
UPDATE : Using urllib
import urllib
import pyodbc
params = urllib.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};SERVER=yourserver;DATABASE=yourdatabase ;UID=user;PWD=password")
cnxn = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
df.to_sql("table_name", cnxn,if_exists='replace')
You can try another package, too, instead of pyodbc, e.g. pytds or adodbapi.
The first one is very simple, with adodbapi the connection config looks like
from adodbapi import adodbapi as adba
raw_config_adodbapi = f"PROVIDER=SQLOLEDB.1;Data Source={server};Initial Catalog={database};trusted_connection=no;User ID={user};Password={password};"
conn = adba.connect(raw_config_adodbapi, timeout=120, autocommit=True)
Besides, it seems like the parameters in the connections string in pyodbc should be enclosed in {}, but maybe it's not mandatory.
I would like to load a SQL query into a data frame as efficiently as possible. I read different sources and everyone seems to use a different approach. I am not sure why... Some are using cursors some aren't.
Currently I have:
import pandas as pd
import pyodbc
con = pyodbc.connect('Driver={something};'
'Server=something;'
'Database=something;'
'Trusted_Connection=yes;'
)
sql="""
SQL CODE
"""
df = pd.read_sql_query(con,sql)
And for some reason, this doesn't work in my machine.
Just pack it in a function. Also I add username and password (just in case)
import pandas as pd
import pyodbc
def GetSQLData(dbName, query):
sPass = 'MyPassword'
sServer = 'MyServer\\SQL1'
uname = 'MyUser'
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=" + sServer + ";"
"Database=" + dbName + ";"
"uid=" + uname + ";pwd=" + sPass)
df = pd.read_sql(query, cnxn)
return df
Try this solution
import pyodbc
import pandas as pd
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=something;"
"Database=something;"
"Trusted_Connection=yes;")
cursor = cnxn.cursor()
cursor.execute('SELECT * FROM something')
for row in cursor:
print(row)
import pandas as pd
import pyodbc
con = pyodbc.connect('Driver={something};'
'Server=something;'
'Database=something;'
'Trusted_Connection=yes;'
)
cursor = con.cursor()
cursor.execute("SQL Syntax")
Not quite sure what your last line is doing, but the cursor method works well and runs with efficiently with minimal code.
This should run. You can test it by adding in sqllist = list(cursor.fetchall()) and then print(sqllist)
I am using following command to load multiple .csv files into Mysql database but i am getting no errors on (the IDLE window) and the data does not load
Here is the erroneous script
#!C:\Python27\python.exe
import MySQLdb
import os
import string
# Open database connection
db = MySQLdb.connect (host="localhost",port=3307,user="root",\
passwd="gamma123",db="test")
cursor=db.cursor()
l = os.listdir(".")
for file_name in l:
print file_name
cursor=db.cursor()
if (file_name.find("DIV.csv")>-1):
#Query under testing
sql = """LOAD DATA LOCAL INFILE file_name \
INTO TABLE system_work \
FIELDS TERMINATED BY ',' \
OPTIONALLY ENCLOSED BY '"' \
LINES TERMINATED BY '\r\n' \
IGNORE 1 LINES;;"""
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()
# disconnect from server
db.close()
But when i try to load a single file using the following python script then its works fine.
please help....
#!C:\Python27\python.exe
import MySQLdb
import os
import string
# Open database connection
db = MySQLdb.connect (host="localhost",port=3307,user="root",\
passwd="gamma123",db="test")
cursor=db.cursor()
#Query under testing
sql = """LOAD DATA LOCAL INFILE 'Axle.csv' \
INTO TABLE system_work \
FIELDS TERMINATED BY ',' \
OPTIONALLY ENCLOSED BY '"' \
LINES TERMINATED BY '\r\n' \
IGNORE 1 LINES;;"""
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()
# disconnect from server
db.close()
You need to interpolate the filename into the SQL string; you are just sending the literal text file_name to the server. You could use the str.format() method for that, any {} placeholder can then be replaced by a variable of your choosing.
You also must indent the try and except blocks to be within the for loop:
sql = """LOAD DATA LOCAL INFILE '{}'
INTO TABLE system_work
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\\r\\n'
IGNORE 1 LINES;;"""
for file_name in l:
print file_name
if file_name.endswith('DIV.csv'):
try:
cursor = db.cursor()
cursor.execute(sql.format(file_name))
db.commit()
except Exception:
# Rollback in case there is any error
db.rollback()
The cursor.execute() method is passed the sql string with the file_name variable interpolated. The {} part on the first line (LOAD DATA LOCAL INFILE '{}') will be replaced by the value in file_name before passing the SQL statement to MySQL.
I also simplified the filename test; presumably it is enough if the filename ends with DIV.csv.
Note that it might just be easier to use the mysqlimport utility; you can achieve the exact same results with:
mysqlimport --fields-terminated-by=, --fields-optionally-enclosed-by=\" \
--local --lines-terminated-by=\r\n --user=root --password=gamma123 \
test *DIV.csv
if (file_name.find("DIV.csv")>-1): unless all of your files are actually called DIV.csv should that be if (file_name.find(".csv")>-1): (that would probably be more efficient testing the last four letters of the file name by the way)
I wanted a script that iterates through csv files in a folder and dump them into a MySQL database. I was able to dump one csv file into it.. But have troubles passing the file name in to the SQL script.
This is the code I use
file_path="C:\csv-files"
files=os.listdir(file_path)
files.sort()
for n in files:
cursor.execute(" LOAD DATA LOCAL INFILE '%s' INTO TABLE new_table FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' Lines terminated by '\n' IGNORE 1 LINES ",(n))
And I get the following error
raise errorclass, errorvalue
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'file1.csv'' INTO TABLE new_table FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY' at line 1")
If I use the file name directly instead of passing it, it works fine.
If you can see in the error thrown out, there seems to be an error in the SQL Script.
This would be the whole code
import csv
import MySQLdb
import sys
import os
connection = MySQLdb.connect(host='localhost',
user='root',
passwd='password',
db='some_db')
cursor = connection.cursor()
file_path="C:\csv-files"
files=os.listdir(file_path)
files.sort()
for n in files:
print n
cursor.execute(" LOAD DATA LOCAL INFILE %s INTO TABLE new_table FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' Lines terminated by '\n' IGNORE 1 LINES " %n)
connection.commit()
cursor.close()
First, replace '%s' with %s in the query. MySQLdb handles any quoting automatically.
Here's the code with some corrections and changes:
import MySQLdb
import os
CSV_DIR = "C:\csv-files"
connection = MySQLdb.connect(host='localhost',
user='root',
passwd='password',
db='some_db',
local_infile=1)
cursor = connection.cursor()
try:
for filename in sorted(os.listdir(CSV_DIR)):
cursor.execute("""LOAD DATA LOCAL INFILE %s
INTO TABLE new_table
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES""",
(os.path.join(CSV_DIR, filename),))
connection.commit()
finally:
cursor.close()
NOTE: I set local_infile parameter to 1 in MySQLdb.connect and pass filename in tuple to execute.
Works for me.