I am trying to read a 100GB+ table in python using pymysql python package.
the query I am firing is
select * from table
But I want to be able to process records in chunks instead of hitting the database for 100 GB records, below is my code
with self.connection.cursor() as cursor:
logging.info("Executing Read query")
logging.info(cursor.mogrify(query))
cursor.execute(query)
schema = cursor.description
size = cursor.rowcount
for i in range((size//batch)+1):
records = cursor.fetchmany(size=batch)
yield records, schema
but when the query gets executed at cursor.execute(query) it tried to get those 100GB records and end up killing the process.
Is there any better way to read data in chunk from mysql using python?
Related
I am extracting millions of data from sql server and inserting into oracle db using python. It is taking 1 record to insert in oracle table in 1 sec.. takes hours to insert. What is the fastest approach to load ?
My code below:
def insert_data(conn,cursor,query,data,batch_size = 10000):
recs = []
count = 1
for rec in data:
recs.append(rec)
if count % batch_size == 0:
cursor.executemany(query, recs,batcherrors=True)
conn.commit()`enter code here`
recs = []
count = count +1
cursor.executemany(query, recs,batcherrors=True)
conn.commit()
Perhaps you cannot buy a 3d Party ETL tool, but you can certainly write a procedure in PL/SQL in the oracle database.
First, install the oracle Transparenet Gateway for ODBC. No license cost involved.
Second, in the oracl db, create a db link to reference the MSSQL database via the gateway.
Third, write a PL/SQL procedure to pull the data from the MSSQL database, via the db link.
I was once presented a problem similar to yours. developer was using SSIS to copy around a million rows from mssql to oracle. Taking over 4 hours. I ran a trace on his process and saw that it was copying row-by-row, slow-by-slow. Took me less than 30 minutes write a pl/sql proc to copy the data, and it completed in less than 4 minutes.
I give a high-level view of the entire setup and process, here:
EDIT:
Thought you might like to see exactly how simple the actual procedure is:
create or replace my_load_proc
begin
insert into my_oracle_table (col_a,
col_b,
col_c)
select sql_col_a,
sql_col_b,
sql_col_c
from mssql_tbl#mssql_link;
end;
My actual procedure has more to it, dealing with run-time logging, emailing notification of completion, etc. But the above is the 'guts' of it, pulling the data from mssql into oracle.
then you might wanna use pandas or pyspark or other big data frameworks available on python
there are a lot of example out there, here is how to load data from Microsoft Docs:
import pyodbc
import pandas as pd
import cx_Oracle
server = 'servername'
database = 'AdventureWorks'
username = 'yourusername'
password = 'databasename'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
query = "SELECT [CountryRegionCode], [Name] FROM Person.CountryRegion;"
df = pd.read_sql(query, cnxn)
# you do data manipulation that is needed here
# then insert data into oracle
conn = create_engine('oracle+cx_oracle://xxxxxx')
df.to_sql(table_name, conn, index=False, if_exists="replace")
something like that, ( that might not work 100% , but just to give you an idea how you can do it)
I'm trying to read a huge PostgreSQL table (~3 million rows of jsonb data, ~30GB size) to do some ETL in Python. I use psycopg2 for working with the database. I want to execute a Python function for each row of the PostgreSQL table and save the results in a .csv file.
The problem is that I need to select the whole 30GB table, and the query runs for a very long time without any possibility to monitor progress. I have found out that there exists a cursor parameter called itersize which determines the number of rows to be buffered on the client.
So I have written the following code:
import psycopg2
conn = psycopg2.connect("host=... port=... dbname=... user=... password=...")
cur = conn.cursor()
cur.itersize = 1000
sql_statement = """
select * from <HUGE TABLE>
"""
cur.execute(sql_statement)
for row in cur:
print(row)
cur.close()
conn.close()
Since the client buffers every 1000 rows on the client, I expect the following behavior:
The Python script buffers the first 1000 rows
We enter the for loop and print the buffered 1000 rows in the console
We reach the point where the next 1000 rows have to be buffered
The Python script buffers the next 1000 rows
GOTO 2
However, the code just hangs on the cur.execute() statement and no output is printed in the console. Why? Could you please explain what exactly is happening under the hood?
I'm trying to insert data with Python from SQL Server to Snowflake table. It works in general, but if I want to insert a bigger chunk of data, it gives me an error:
snowflake connector SQL compilation error: maximum number of expressions in a list exceeded, expected at most 16,384
I'm using snowflake connector for Python. So, it works if you want to insert 16384 rows at once. My table has over a million records. I don't want to use csv files.
I was able to insert > 16k recs using sqlalchemy and pandas as:
pandas_df.to_sql(sf_table, con=engine, index=False, if_exists='append', chunksize=16000)
where engine is sqlalchemy.create_engine(...)
This is not the ideal way to load data into Snowflake, but since you specified that you didn't want to create CSV files, you could look into loading the data into a panda dataframe and then use the write_pandas function in the python connector, which will (behind the scenes) leverage a flat file and a COPY INTO statement, which is the fastest way to get data into Snowflake. This issue with this method will likely be that pandas requires a lot of memory on the machine you are running the script on. There is a chunk_size parameter, though, so you can control it with that.
https://docs.snowflake.com/en/user-guide/python-connector-api.html#write_pandas
For Whoever is facing that problem, find below a complete solution to connect and to insert data into Snowflake using Sqlalchemy.
from sqlalchemy import create_engine
import pandas as pd
snowflake_username = 'username'
snowflake_password = 'password'
snowflake_account = 'accountname'
snowflake_warehouse = 'warehouse_name'
snowflake_database = 'database_name'
snowflake_schema = 'public'
engine = create_engine(
'snowflake://{user}:{password}#{account}/{db}/{schema}?warehouse={warehouse}'.format(
user=snowflake_username,
password=snowflake_password,
account=snowflake_account,
db=snowflake_database,
schema=snowflake_schema,
warehouse=snowflake_warehouse,
),echo_pool=True, pool_size=10, max_overflow=20
)
try:
connection = engine.connect()
results = connection.execute('select current_version()').fetchone()
print(results[0])
df.columns = map(str.upper, df_sensor.columns)
df.to_sql('table'.lower(), con=connection, schema='amostraschema', index=False, if_exists='append', chunksize=16000)
finally:
connection.close()
engine.dispose()
Use the executemany with server side binding. This will create files in a staging area and will allow you to insert more than 16384 rows.
con = snowflake.connector.connect(
account='',
user = '',
password = '',
dbname='',
paramstyle = 'qmark')
sql = "insert into tablename (col1, col2) values (?, ?)"
rows = [[1, 2], [3, 4]]
con.cursor().executemany(sql, rows)
See https://docs.snowflake.com/en/user-guide/python-connector-example.html#label-python-connector-binding-batch-inserts for more details
Note: This will not work with client side binding, the %s format.
I have a pipeline that reads gzipped csv data into python and inserts the data into a postgres database, row by row, connected using psycopg2. I've created a thread connection pool, but I'm unsure how to leverage this to insert each row in a separate thread, rather than inserting sequentially. The internet gives me mixed messages if this is even possible, and I have some experience with the threading python module but not a lot.
The pipeline currently is successful, but it is slow, and I'm hoping that it can be made faster by inserting the rows across threads, rather than sequentially.
The following code is simplified for clarity:
main script
for row in reader:
insertrows(configs, row)
insertrows script
threadpool = pool.ThreadedConnectionPool(5, 20, database=dbname, port=port, user=user, password=password, host=host)
con = threadpool.getconn()
con.autocommit = True
cur = con.cursor()
cur.execute("INSERT INTO table VALUES row")
cur.close()
threadpool.putconn(con)
What I would like to do is rather than looping through the rows, create something like the threading example in this link but without a strong frame of reference for multithreading it's hard for me to figure out how to write something like that for my purposes.
I am a new Python coder and also a new data scientist so please forgive any foolish sounding things here. I'll keep the details out unless anyone's curious but basically I need to connect to Microsoft SQL Server and upload a Pandas DF that is relatively large (~500k rows) and I need to do this almost every day as the project currently stands.
It doesn't have to be a Pandas DF - I've read about using odo for csv files but I haven't been able to get anything to work. The issue I'm having is that I can't bulk insert the DF because the file isn't on the same machine as the SQL Server instance. I'm consistently getting errors like the following:
pyodbc.ProgrammingError: ('42000', "[42000] [Microsoft][ODBC SQL
Server Driver][SQL Server]Incorrect syntax near the keyword 'IF'.
(156) (SQLExecDirectW)")
As I've attempted different SQL statements you can replace IF with whatever has been the first COL_NAME in the CREATE statement. I'm using SQLAlchemy to create the engine and connect to the database. This may go without saying but the pd.to_sql() method is just way too slow for how much data I'm moving so that's why I need something faster.
I'm using Python 3.6 by the way. I've put down here most of the things that I've tried that haven't been successful.
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('test_col'))
address = 'mssql+pyodbc://uid:pw#server/path/database?driver=SQL Server'
engine = create_engine(address)
connection = engine.raw_connection()
cursor = connection.cursor()
# Attempt 1 <- This failed to even create a table at the cursor_execute statement so my issues could be way in the beginning here but I know that I have a connection to the SQL Server because I can use pd.to_sql() to create tables successfully (just incredibly slowly for my tables of interest)
create_statement = """
DROP TABLE test_table
CREATE TABLE test_table (test_col)
"""
cursor.execute(create_statement)
test_insert = '''
INSERT INTO test_table
(test_col)
values ('abs');
'''
cursor.execute(test_insert)
Attempt 2 <- From iabdb WordPress blog I came across
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
records = [str(tuple(x)) for x in take_rates.values]
insert_ = """
INSERT INTO test_table
("A")
VALUES
"""
for batch in chunker(records, 2): # This would be set to 1000 in practice I hope
print(batch)
rows = str(batch).strip('[]')
print(rows)
insert_rows = insert_ + rows
print(insert_rows)
cursor.execute(insert_rows)
#conn.commit() # don't know when I would need to commit
conn.close()
# Attempt 3 # From a related Stack Exchange Post
create the table but first drop if it already exists
command = """DROP TABLE IF EXISTS test_table
CREATE TABLE test_table # these columns are from my real dataset
"Serial Number" serial primary key,
"Dealer Code" text,
"FSHIP_DT" timestamp without time zone,
;"""
cursor.execute(command)
connection.commit()
# stream the data using 'to_csv' and StringIO(); then use sql's 'copy_from' function
output = io.StringIO()
# ignore the index
take_rates.to_csv(output, sep='~', header=False, index=False)
# jump to start of stream
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
# null values become ''
cur.copy_from(output, 'Config_Take_Rates_TEST', null="")
connection.commit()
cur.close()
It seems to me that MS SQL Server is just not a nice Database to play around with...
I want to apologize for the rough formatting - I've been at this script for weeks now but just finally decided to try to organize something for StackOverflow. Thank you very much for any help anyone can offer!
If you only need to replace the existing table, truncate it and use bcp utility to upload the table. It's much faster.
from subprocess import call
command = "TRUNCATE TABLE test_table"
take_rates.to_csv('take_rates.csv', sep='\t', index=False)
call('bcp {t} in {f} -S {s} -U {u} -P {p} -d {db} -c -t "{sep}" -r "{nl}" -e {e}'.format(t='test_table', f='take_rates.csv', s=server, u=user, p=password, db=database, sep='\t', nl='\n')
You will need to install bcp utility (yum install mssql-tools on CentOS/RedHat).
'DROP TABLE IF EXISTS test_table' just looks like invalid tsql syntax.
you can do something like this:
if (object_id('test_table') is not null)
DROP TABLE test_table