Im using Pyodbc to connect to sqlserver to get few rows. The select query I execute fetches almost 200,000 rows causing a memory issue.
To resolve this issue Im using a generator object, to fetch 5000 rows at any point in time..
The problem with this kind of execution is the generator object. I lose the data column names..
For example, if my table1 has column NAME, through normal exection I can access the result set as result.NAME
but I can't do the same with the generator object..It doesn't allow me to access through column names.
Any inputs will be useful?
Using Cursor.fetchmany() to process query result in batches returns a list of pyodbc.Row objects, which allows reference by column name.
Take these examples of a SQL Server query that returns database names in batches of 5:
Without generator
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}',
server='localhost', database='master',
trusted_connection='yes')
sql = 'select name from sys.databases'
cursor = connection.cursor().execute(sql)
while True:
rows = cursor.fetchmany(5)
if not rows:
break
for row in rows:
print row.name
With generator (modified from sample here)
def rows(cursor, size=5):
while True:
rows = cursor.fetchmany(size)
if not rows:
break
for row in rows:
yield row
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}',
server='localhost', database='master',
trusted_connection='yes')
sql = 'select name from sys.databases'
cursor = connection.cursor().execute(sql)
for row in rows(cursor):
print row.name
Related
I am trying to update a SQL table with updated information which is in a dataframe in pandas.
I have about 100,000 rows to iterate through and it's taking a long time. Any way I can make this code more efficient. Do I even need to truncate the data? Most rows will probably be the same.
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('TRUNCATE dbo.Sheet1$')
for index, row in df_union.iterrows():
print(row)
cursor.execute("INSERT INTO dbo.Sheet1$ (Vendor, Plant) values(?,?)", row.Vendor, row.Plant)
Update: This is what I ended up doing.
params = urllib.parse.quote_plus(r'DRIVER={xxx};SERVER=xxx;DATABASE=xxx;Trusted_Connection=yes')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
df = pd.read_excel('xxx.xlsx')
print("loaded")
df.to_sql(name='tablename',schema= 'dbo', con=engine, if_exists='replace',index=False, chunksize = 1000, method = 'multi')
Don't use for or cursors just SQL
insert into TABLENAMEA (A,B,C,D)
select A,B,C,D from TABLENAMEB
Take a look to this link to see another demo:
https://www.sqlservertutorial.net/sql-server-basics/sql-server-insert-into-select/
You just need to update this part to run a normal insert
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('insert into TABLENAMEA (A,B,C,D) select A,B,C,D from TABLENAMEB')
You don't need to store the dataset in a variable, just run the query directly as normal SQL, performance will be better than a iteration
My Python code works to this point and returns several rows. I need to take each row and process it in a loop in Python. The first row works fine and does its trick, but the second row never runs. Clearly, I am not looping correctly. I believe I am not iterating over each row in the results. Here is the code:
for row in results:
print(row[0])
F:\FinancialResearch\SEC\myEdgar\sec-edgar-filings\A\10-K\0000014693-21-000091\full-submission.txt
F:\FinancialResearch\SEC\myEdgar\sec-edgar-filings\A\10-K\0000894189-21-001890\full-submission.txt
F:\FinancialResearch\SEC\myEdgar\sec-edgar-filings\A\10-K\0000894189-21-001895\full-submission.txt
for row in results:
with open(row[0],'r') as f:
contents = f.read()
bill = row
for x in range(0, 3):
VanHalen = 'Hello'
cnxn1 = pyodbc.connect('Driver={SQL Server};'
'Server=XXX;'
'Database=00010KData;'
'Trusted_Connection=yes;')
curs1 = cnxn1.cursor()
curs1.execute('''
Update EdgarComments SET Comments7 = ? WHERE FullPath = ?
''', (VanHalen,bill))
curs1.commit()
curs1.close()
cnxn1.close()
print(x)
Error: ('HY004', '[HY004] [Microsoft][ODBC SQL Server Driver]Invalid SQL data type (0) (SQLBindParameter)')
The bill variable that you are storing in the FullPath column contains all the rows - is this what you want?
I would normally expect to see the file path (row[0]) being stored given the column name FullPath.
Since this is an Invalid Type error on the binding parameters, you can always check the type of the bill variable before inserting and make sure it is a type that the SQL driver accepts - usually you wanna convert strange types to strings before using them as binding-params.
I am trying to update and insert rows in a large postgres table comprising approx 30 millions of records using psycopg2 in python, I am doing so in batches of 100K records(~It takes 6 minutes for one batch) as I don't want to open the transaction for too long as to avoid creating row locking as the table rows are used by other transactions too while I am writing them. I am opening and closing the connection and the cursor every time in a loop .
so to update/insert via cursors in postgres which one (or none) of the below is preferable to avoid locks and also get better performance?
1> Opening the connection as well as closing the connection with every batch.
open the connection for the batch.
open the cursor for the batch.
commit the transaction for the batch.
close the cursor for the batch.
close the connection for the batch.
2> Opening and closing the connection once but opening the cursor alone every time with every batch.
open the connection once.
open the cursor for the batch.
commit the transaction for the batch.
close the cursor for the batch.
close the connection after the last batch.
Please advice if there's a better option as well. Currently I am using cursor.execute to execute insert/update queries but due to not so fast performance ,I have to choose batching. Since I don't have enough permissions to play with dropping the indexes while insert, I am using the route of batching.
Queries used :-
Update:-
UPDATE target_tbl tgt
set descr = stage.descr,
prod_name = stage.prod_name,
item_name = stage.item_name,
url = stage.url,
col1_name = stage.col1_name,
col2_name = stage.col2_name,
col3_name = stage.col3_name,
col4_name = stage.col4_name,
col5_name = stage.col5_name,
col6_name = stage.col6_name,
col7_name = stage.col7_name,
col8_name = stage.col8_name,
flag = stage.flag
from tbl1 stage
where
tgt.col1 = stage.col1
and tgt.col2 = stage.col2
and coalesce(tgt.col3, 'col3'::text) = coalesce(stage.col3, 'col3'::text)
and coalesce(tgt.col4, 'col4'::text) = coalesce(stage.col4, 'col4'::text);
Insert:-
Insert into tgt
select
stage.col1,
stage.col2,
stage.col3,
stage.col4
stage.prod_name,
stage.item_name,
stage.url,
stage.col1_name,
stage.col2_name,
stage.col3_name,
stage.col4_name,
stage.col5_name,
stage.col6_name,
stage.col7_name,
stage.col8_name,
stage.flag
from tbl1 stage
where NOT EXISTS (
select from tgt where
tgt.col1 = stage.col1
and tgt.col2 = stage.col2
and coalesce(tgt.col3, 'col3'::text) = coalesce(stage.col3, 'col3'::text)
and coalesce(tgt.col4, 'col4'::text) = coalesce(stage.col4, 'col4'::text)
) ;
Upsert function that requires no SQL coding on your python side only using the power of SQLAlchemy.
You can also decide on the batch size, I have used with batch size of 1000 but you can experiment with bigger.
Consider this function if you have a List of dictionaries and SQL Table that contain the same column names and types already. If you are using a DataFrame you can simply do a df.to_dict('records') and you will obtain the List of dictionaries that is ready for input.
Make sure that you Dict keys and Table columns match
from sqlalchemy import Table
from sqlalchemy.engine.base import Engine as sql_engine
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.ext.automap import automap_base
import pandas as pd
from sqlalchemy import create_engine
from typing import List, Dict
engine = create_engine(...)
def upsert_database(list_input: List[Dict], engine: sql_engine, table: str, schema: str) -> None:
if len(list_input) == 0:
return None
with engine.connect() as conn:
base = automap_base()
base.prepare(engine, reflect=True, schema=schema)
target_table = Table(table, base.metadata,
autoload=True, autoload_with=engine, schema=schema)
chunks = [list_input[i:i + 1000] for i in range(0, len(list_input), 1000)]
for chunk in chunks:
stmt = insert(target_table).values(chunk)
update_dict = {c.name: c for c in stmt.excluded if not c.primary_key}
conn.execute(stmt.on_conflict_do_update(
constraint=f'{table}_pkey',
set_=update_dict)
)
I'm trying to read data from an oracle db.
I have to read on python the results of a simple select that returns a million of rows.
I use the fetchall() function, changing the arraysize property of the cursor.
select_qry = db_functions.read_sql_file('src/data/scripts/03_perimetro_select.sql')
dsn_tns = cx_Oracle.makedsn(ip, port, sid)
con = cx_Oracle.connect(user, pwd, dsn_tns)
start = time.time()
cur = con.cursor()
cur.arraysize = 1000
cur.execute('select * from bigtable where rownum < 10000')
res = cur.fetchall()
# print res # uncomment to display the query results
elapsed = (time.time() - start)
print(elapsed, " seconds")
cur.close()
con.close()
If I remove the where condition where rownum < 10000 the python environment freezes and the fetchall() function never ends.
After some trials I found a limit for this precise select, it works till 50k lines, but it fails if I select 60k lines.
What is causing this problem? Do I have to find another way to fetch this amount of data or the problem is the ODBC connection? How can I test it?
Consider running in batches using Oracle's ROWNUM. To combine back into single object append to a growing list. Below assumes total row count for table is 1 mill. Adjust as needed:
table_row_count = 1000000
batch_size = 10000
# PREPARED STATEMENT
sql = """SELECT t.* FROM
(SELECT *, ROWNUM AS row_num
FROM
(SELECT * FROM bigtable ORDER BY primary_id) sub_t
) AS t
WHERE t.row_num BETWEEN :LOWER_BOUND AND :UPPER_BOUND;"""
data = []
for lower_bound in range(0, table_row_count, batch_size):
# BIND PARAMS WITH BOUND LIMITS
cursor.execute(sql, {'LOWER_BOUND': lower_bound,
'UPPER_BOUND': lower_bound + batch_size - 1})
for row in cur.fetchall():
data.append(row)
You are probably running out of memory on the computer running cx_Oracle. Don't use fetchall() because this will require cx_Oracle to hold all result in memory. Use something like this to fetch batches of records:
cursor = connection.cursor()
cursor.execute("select employee_id from employees")
res = cursor.fetchmany(numRows=3)
print(res)
res = cursor.fetchmany(numRows=3)
print(res)
Stick the fetchmany() calls in a loop, process each batch of rows in your app before fetching the next set of rows, and exit the loop when there is no more data.
What ever solution you use, tune cursor.arraysize to get best performance.
The already given suggestion to repeat the query and select subsets of rows is also worth considering. If you are using Oracle DB 12 there is a newer (easier) syntax like SELECT * FROM mytab ORDER BY id OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY.
PS cx_Oracle does not use ODBC.
I have the data in pandas dataframe which I am storing in SQLITE database using Python. When I am trying to query the tables inside it, I am able to get the results but without the column names. Can someone please guide me.
sql_query = """Select date(report_date), insertion_order_id, sum(impressions), sum(clicks), (sum(clicks)+0.0)/sum(impressions)*100 as CTR
from RawDailySummaries
Group By report_date, insertion_order_id
Having report_date like '2014-08-12%' """
cursor.execute(sql_query)
query1 = cursor.fetchall()
for i in query1:
print i
Below is the output that I get
(u'2014-08-12', 10187, 2024, 8, 0.3952569169960474)
(u'2014-08-12', 12419, 15054, 176, 1.1691244851866613)
What do I need to do to display the results in a tabular form with column names
In DB-API 2.0 compliant clients, cursor.description is a sequence of 7-item sequences of the form (<name>, <type_code>, <display_size>, <internal_size>, <precision>, <scale>, <null_ok>), one for each column, as described here. Note description will be None if the result of the execute statement is empty.
If you want to create a list of the column names, you can use list comprehension like this: column_names = [i[0] for i in cursor.description] then do with them whatever you'd like.
Alternatively, you can set the row_factory parameter of the connection object to something that provides column names with the results. An example of a dictionary-based row factory for SQLite is found here, and you can see a discussion of the sqlite3.Row type below that.
Step 1: Select your engine like pyodbc, SQLAlchemy etc.
Step 2: Establish connection
cursor = connection.cursor()
Step 3: Execute SQL statement
cursor.execute("Select * from db.table where condition=1")
Step 4: Extract Header from connection variable description
headers = [i[0] for i in cursor.description]
print(headers)
Try Pandas .read_sql(), I can't check it right now but it should be something like:
pd.read_sql( Q , connection)
Here is a sample code using cx_Oracle, that should do what is expected:
import cx_Oracle
def test_oracle():
connection = cx_Oracle.connect('user', 'password', 'tns')
try:
cursor = connection.cursor()
cursor.execute('SELECT day_no,area_code ,start_date from dic.b_td_m_area where rownum<10')
#only print head
title = [i[0] for i in cursor.description]
print(title)
# column info
for x in cursor.description:
print(x)
finally:
cursor.close()
if __name__ == "__main__":
test_oracle();