python-2.7.15, pymssql-2.1.4, SQL_Server-2018, Windows 10 Pro, MS-Office-2016
import time
import csv
import pymssql
db_settings = {
"host" : "127.0.0.1",
"port" : "1433",
"user" : "sa",
"password" : "********",
"database" : "testdb",
"charset" : "utf8"
}
conn = pymssql.connect(**db_settings)
cursor = conn.cursor()
ff = csv.reader(open('base.csv', 'r'))
sql = """
BEGIN
INSERT INTO Base([name], [year], [update], [status],
[timeline], [language], [pic]) VALUES (%s, %s, %s, %s, %s, %s, %s)
END
"""
now=time.strftime("%M:%S")
t = []
for i in ff:
i = i[1:]
if "year" in i:
pass
else:
t.append((i[0], i[1], i[3], i[4], i[6], i[5], i[8]))
cursor.executemany(sql, t)
conn.commit()
end=time.strftime("%M:%S")
print(now+","+end)
The file of "base.csv" size is 21.7 MB and 30374 rows. When I execute the above code, It will take 929 seconds to completed. This is meaning only 32.7 rows/second, it too slow. Who can to help me find out the reason?Thank a lot. :-)
I reduced time of execute_many in pymssql from 30min to 30s like this.
In sql you can create insert statements with multiple rows at once. It looks like below
INSERT (col_name1, col_name2)
INTO table_name
VALUES
(row1_val1, row1_val2),
(row2_val1, row2_val2) ...
(row1000_val1, row1000_val2)
I implemented insert function which gets chunks of data and modifies the query to insert multiple values with one execute.
def insert(query, data, chunk=999):
conn = get_connection()
cursor = conn.cursor()
query = query.lower()
insert_q, values_q = query.split('values') # get part with the query and the parameters
insert_q += 'values' # add values to make sql query correct after split
for chunk_data in chunks(data, chunk):
# chunk_data contains list of row parameters
flat_list = [item for sublist in chunk_data for item in sublist] # we make it flat to use execute later instead execute_many
chunk_query = insert_q + ','.join([values_q] * len(chunk_data)) # creating the query with multiple values insert
cursor.execute(chunk_query, tuple(flat_list)
conn.commit()
chunks can be implemented like this (thanks to on of the great reply from this forum)
def chunks(lst, n):
for i in range(0, len(lst), n):
yield lst[i:i + n]
Example usage
insert('INSERT (user_id, name, surname) INTO users VALUES (%s, %s, %s)',
[(1, 'Jack', 'Kcaj'), (2, 'Andrew', 'Golara')]
Related
Using execute 40 inserts per minute
Using executemany 41 inserts per minute
Using extras.execute_Values 42 inserts per minute
def save_return_to_postgres(record_to_insert) -> Any:
insert_query = """INSERT INTO pricing.xxxx (description,code,unit,price,created_date,updated_date)
VALUES %sreturning id"""
records = (record_to_insert[2],record_to_insert[1],record_to_insert[3],record_to_insert[4],record_to_insert[0],datetime.datetime.now())
# df = df[["description","code","unit","price","created_date","updated_date"]]
try:
conn = psycopg2.connect(database = 'xxxx',
user = 'xxxx',
password = 'xxxxx',
host= 'xxxx',
port='xxxx',
connect_timeout = 10)
print("Connection Opened with Postgres")
cursor = conn.cursor()
extras.execute_values(cursor, insert_query, [records])
conn.commit()
# print(record_to_insert)
finally:
if conn:
cursor.close()
conn.close()
print("Connection to postgres was successfully closed")
valores = df.values
for valor in valores:
save_return_to_postgres(valor)
print(valor)
I don't know how much lines-per-insert postgres can take
But many SQL-based databases can take multiples inserts at the same time.
So instead of running
for insert_query in queries:
sql_execute(insert_query)
Try making several inserts at once in a single command
(Test it on pure SQL first to see if it works)
insert_list=[]
for insert_query in queries:
insert_list.append(insert_query)
sql_execute(insert_list)
I had a similar issue and this link helped me
https://www.sqlservertutorial.net/sql-server-basics/sql-server-insert-multiple-rows/
(of course mine was not Postgres but the idea is the same,
decrease internet time by running multiple inserts in one command)
Tamo Junto
Use execute_batch or execute_values and use them over the entire record set. As of now you are not using the batch capabilities of execute_values because you are inserting a single record at a time. You are further slowing things down by opening and closing a connection for each record as that is a time/resource expensive operation. Below is untested as I don't have the actual data and am assuming what df.values is.
insert_query = """INSERT INTO pricing.xxxx (description,code,unit,price,created_date,updated_date)
VALUES %s returning id"""
#execute_batch query
#insert_query = """INSERT INTO pricing.xxxx #(description,code,unit,price,created_date,updated_date)
# VALUES (%s, %s, %s, %s, %s, %s) returning id"""
valores = df.values
#Create a list of lists to pass to query as a batch instead of singly.
records = [[record_to_insert[2],record_to_insert[1],record_to_insert[3],
record_to_insert[4],record_to_insert[0],datetime.datetime.now()]
for record_to_insert in valores]
try:
conn = psycopg2.connect(database = 'xxxx',
user = 'xxxx',
password = 'xxxxx',
host= 'xxxx',
port='xxxx',
connect_timeout = 10)
print("Connection Opened with Postgres")
cursor = conn.cursor()
extras.execute_values(cursor, insert_query, [records])
#execute_batch
#extras.execute_batch(cursor, insert_query, [records])
conn.commit()
# print(record_to_insert)
finally:
if conn:
cursor.close()
conn.close()
print("Connection to postgres was successfully closed")
For more information see Fast execution helpers. Note that both the execute_values and execute_batch functions have a page_size argument of default value 100. This is the batch size for the operations. For large data sets you can reduce the time further by increasing the page_size to make bigger batches and reduce the number of server round trips .
I'm quite new in Python (Python 3.4.6) :)
I'm trying to insert into a mysql db some lines but with variables.
At the beginning, I've a dictionary list_hosts.
Here is my code :
import mysql.connector
import time
db = mysql.connector.connect(host='localhost', user='xxxxx', passwd='xxxxx', database='xxxxx')
cursor = db.cursor()
now_db = time.strftime('%Y-%m-%d %H:%M:%S')
for key, value in list_hosts
key_db += key+", "
value_ex += "%s, "
value_db += "\""+value+"\", "
key_db = key_db.strip(" ")
key_db = key_db.strip(",")
value_ex = value_ex.strip(" ")
value_ex = value_ex.strip(",")
value_db = value_db.strip(" ")
value_db = value_db.strip(",")
add_host = ("INSERT INTO nagios_hosts (date_created, date_modified, "+key_db+") VALUES ("+value_ex+")")
data_host = ("\""+now_db+"\", \""+now_db+"\", "+value_db)
cursor.execute(add_host, data_host)
db.commit()
db.close()
Example of list_hosts:
OrderedDict([('xxxx1', 'data1'), ('xxxx2', 'data2'), ('xxxx3', 'data3'), ('xxxx4', 'data4'), ('xxxx5', 'data5'), ('xxxx6', 'data6')])
I've simplified the code of course.
I did it like this as I've never have the same amount of items in the dictionnary.
I'm trying to create something like this :
add_host - INSERT INTO TABLE (date_created, date_modified, xxxx1, xxxx2, xxxx3, xxxx4, xxxx5, xxxx6) VALUES (%s, %s, %s, %s, %s, %s)
data_host - now, now, data1, data2, data3, data4, data5, data6
Where there are never the same number of xxxx...
They all exist in the DB, but I don't need to fill each column for each item in the dictionnary.
When I execute I get this error :
mysql.connector.errors.ProgrammingError: 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'xxxxxxxxxxxxxxxxxxxxx' at line 1
As I'm beginning with Python, I think there are a lot of things we can clean too... don't hesitate :)
Here's a canonical python3 (python2 compatible) solution:
import time
from collections import OrderedDict
list_hosts = OrderedDict([("field1", "value1"), ("field2", "value2"), ("fieldN", "valueN")])
# builds a comma-separated string of db placeholders for the values:
placeholders = ", ".join(["%s"] * (len(list_hosts) + 2))
# builds a comma-separated string of field names
fields = ", ".join(("date_created","date_modified") + tuple(list_hosts.keys()))
# builds a tuple of values including the dates
now_db = time.strftime('%Y-%m-%d %H:%M:%S')
values = (now_db, now_db) + tuple(list_hosts.values())
# build the SQL query:
sql = "INSERT INTO nagio_hosts({}) VALUES({})".format(fields, placeholders)
# and safely execute it
cursor.execute(sql, values)
db.commit()
As #khelwood mentioned in the comments, you should use parameterized queries.
If the number of columns you're inserting varies, you might prefer to generate a tuple and use it in a parameterized query then.
cursor.execute() accepts two parameters:
a query as a string;
parameters as a tuple.
The idea is to generate the string and the tuple and pass those to cursor.execute().
You'll need something like this:
list_hosts = {'xxxx1': 'data1', 'xxxx2': 'data2', 'xxxx3': 'data3', 'xxxx4': 'data4'}
keys = [] # creating a list for keys
values = () # creating a tuple for values
for key, value in list_hosts.items():
keys.append(key)
values = values + (value,)
keys_str = ', '.join(keys)
ps = ', '.join(['%s'] * len(list_hosts))
query = "INSERT INTO tbl (%s) VALUES (%s)" % (keys_str, ps)
print(query)
# INSERT INTO tbl (data_created, data_modified, xxxx1, xxxx2, xxxx3, xxxx4) VALUES (%s, %s, %s, %s)
cursor.execute(query, values)
Just tried it on a sample data, works fine!
I'm doing insert or update around 3 to 4 millions of data in postgresql using python script. Please see the code below. The requirement is insert if its new key or update the key with new value if key is already exist. But below code is making too much round trip connection to DB and its taking around 35 - 45 minutes to insert the 3 million records in DB which is very slow. How to avoid round trip connection and insert or update in a faster way?
Any help would be really appreciated.
Thanks for your help in advance.
InputFile.txt - This file has around 3 to 4 million line itesm
productKey1 printer1,printerModel1,printerPrice1,printerDesc1|
productKey2 sacnner2,scannerModel2,scannerPrice2,scannerDesc2|
productKey3 mobile3,mobileModel3,mobilePrice3,mobileDesc3|
productKey4 tv4,tvModel4,tvPrice4,tvDescription4|
productKey2 sacnner22,scannerModel22,scannerPrice22,scannerDesc22|
insert.py
def insertProduct(filename, conn):
seen = set()
cursor = conn.cursor()
qi = "INSERT INTO productTable (key, value) VALUES (%s, %s);"
qu = "UPDATE productTable SET value = CONCAT(value, %s) WHERE key = %s;"
with open(filename) as f:
for line in f:
if line.strip():
key, value = line.split(' ', 1)
if key not in seen:
seen.add(key)
cursor.execute(qi, (key, value))
else:
cursor.execute(qu, (value, key))
conn.commit()
conn = psycopg2.connect("dbname='productDB' user='myuser' host='localhost'")
insertProduct('InputFile.txt', conn)
Execute batches of prepared statements. http://initd.org/psycopg/docs/extras.html#fast-execution-helpers
import psycopg2, psycopg2.extras
def insertProduct(filename, conn):
data = []
with open(filename) as f:
for line in f:
line = line.strip()
if line:
key, value = line.split(' ', 1)
data.append((key, value))
cursor = conn.cursor()
cursor.execute("""
prepare upsert (text, text) as
with i as (
insert into productTable (key, value)
select $1, $2
where not exists (select 1 from productTable where key = $1)
returning *
)
update productTable p
set value = concat (p.value, $2)
where p.key = $1 and not exists (select 1 from i)
""")
psycopg2.extras.execute_batch(cursor, "execute upsert (%s, %s)", data, page_size=500)
cursor.execute("deallocate upsert")
conn.commit()
conn = psycopg2.connect(database='cpn')
insertProduct('InputFile.txt', conn)
I have a folder called 'testfolder' that includes two files -- 'Sigurdlogfile' and '2004ADlogfile'. Each file has a list of strings called entries. I need to run my code on both of them and am using glob to do this. My code creates a dictionary for each file and stores data extracted using regex where the dictionary keys are stored in commonterms below. Then it inserts each dictionary into a mysql table. It does all of this successfully, but my second sql statement is not inserting how it should (per file).
import glob
import re
files = glob.glob('/home/user/testfolder/*logfile*')
commonterms = (["freq", "\s?(\d+e?\d*)\s?"],
["tx", "#txpattern"],
["rx", "#rxpattern"], ...)
terms = [commonterms[i][0] for i in range(len(commonterms))]
patterns = [commonterms[i][1] for i in range(len(commonterms))]
def getTerms(entry):
for i in range(len(terms)):
term = re.search(patterns[i], entry)
if term:
term = term.groups()[0] if term.groups()[0] is not None else term.groups()[1]
else:
term = 'NULL'
d[terms[i]] += [term]
return d
for filename in files:
#code to create 'entries'
objkey = re.match(r'/home/user/testfolder/(.+?)logfile', filename).group(1)
d = {t: [] for t in terms}
for entry in entries:
d = getTerms(entry)
import MySQLdb
db = MySQLdb.connect(host='', user='', passwd='', db='')
cursor = db.cursor()
cols = d.keys()
vals = d.values()
for i in range(len(entries)):
lst = [item[i] for item in vals]
csv = "'{}'".format("','".join(lst))
sql1 = "INSERT INTO table (%s) VALUES (%s);" % (','.join(cols), csv.replace("'NULL'", "NULL"))
cursor.execute(sql1)
#now in my 2nd sql statement I need to update the table with data from an old table, which is where I have the problem...
sql2 = "UPDATE table, oldtable SET table.key1 = oldtable.key1,
table.key2 = oldtable.key2 WHERE oldtable.obj = %s;" % repr(objkey)
cursor.execute(sql2)
db.commit()
db.close()
The problem is that in the second sql statement, it ends up inserting that data into all columns of the table from only one of the objkeys, but I need it to insert different data depending on which file the code is currently running on. I can't figure out why this is, since I've defined objkey inside my for filename in files loop. How can I fix this?
Instead of doing separate INSERT and UPDATE, do them together to incorporate the fields from the old table.
for i in range(len(entries)):
lst = [item[i] for item in vals]
csv = "'{}'".format("','".join(lst))
sql1 = """INSERT INTO table (key1, key2, %s)
SELECT o.key1, o.key2, a.*
FROM (SELECT %s) AS a
LEFT JOIN oldtable AS o ON o.obj = %s""" % (','.join(cols), csv.replace("'NULL'", "NULL"), repr(objkey))
cursor.execute(sql1)
I am trying to take the data from a dictionary (the example is simplified for readability) and insert it into a mysql database.
I have the following piece of code.
import pymysql
conn = pymysql.connect(server, user , password, "db")
cur = conn.cursor()
ORFs={'E7': '562', 'E6': '83', 'E1': '865', 'E2': '2756 '}
table="genome"
cols = ORFs.keys()
vals = ORFs.values()
sql = "INSERT INTO %s (%s) VALUES(%s)" % (
table, ",".join(cols), ",".join(vals))
print sql
print ORFs.values()
cur.execute(sql, ORFs.values())
cur.close()
conn.close()
the print sql statement returns
INSERT INTO genome (E7,E6,E1,E2) VALUES(562,83,865,2756 )
when I type this directly into the mysql command line, the mysql command works. But when I run the python script I get an error:
<type 'exceptions.TypeError'>: not all arguments converted during string formatting
args = ('not all arguments converted during string formatting',)
message = 'not all arguments converted during string formatting'
As always, any suggestions would be highly appreciated.
The previous answer doesn't work for non string dictionary value. This one is a revised version.
format_string = ','.join(['%s'] * len(dict))
self.db.set("""INSERT IGNORE INTO listings ({0}) VALUES ({1})""".format(", ".join(dict.keys()),format_string),
(dict.values()))
sql = "INSERT INTO %s (%s) VALUES(%s)" % (
table, ",".join(cols), ",".join(vals))
This SQL includes values and cur.execute(sql, ORFs.values()) has values, too.
So, it should be cur.execute(sql).
In my case, I will skip null columns.
data = {'k': 'v'}
fs = ','.join(list(map(lambda x: '`' + x + '`', [*data.keys()])))
vs = ','.join(list(map(lambda x: '%(' + x + ')s', [*data.keys()])))
sql = "INSERT INTO `%s` (%s) VALUES (%s)" % (table, fs, vs)
count = cursor.execute(sql, data)