Copy Redshift table from S3 csv file using Python?

Copy Redshift table from S3 csv file using Python? - python

What is the recommended module and syntax to programatically copy from an S3 csv file to a Redshift table? I've been trying with the psycopg2 module, but without success (see psycopg2 copy_expert() - how to copy in a gzipped csv file?). I've tried cur.execute(), cur.copy_expert() and cur.copy_from() - all unsuccessfully. My experience and comments I've read lead me to conclude that psycopg2, while sufficient for python-programming a postgres DB, will not work for Redshift tables for some reason. So what is the workaround if I want a Python script to do this copy?
Here is the COPY statement I want to run. The source is a gzipped csv file with a pipe delimiter. This works fine from a SQL interface like DBeaver, but I can't figure out how it would translate to Python:
'''COPY <destination_table> from 's3://bucket/my_source_file.csv.gz' CREDENTIALS <my credentials> delimiter '|' IGNOREHEADER 1 ENCODING UTF8 IGNOREBLANK LINES NULL AS 'NULL' EMPTYASNULL BLANKSASNULL gzip ACCEPTINVCHARS timeformat 'auto' dateformat 'auto' MAXERROR 100 compupdate on;'''

I use ODBC using the pyODBC library successfully. Just call .execute(copy-command) and you shouldn't have an issue.

There are plenty of examples online of connecting to Amazon Redshift from Python. For example:
Connect RedShift via Python's [psycopg2]
Access your data in Amazon Redshift and PostgreSQL with Python and R
Copying data from S3 to AWS redshift using python and psycopg2
They typically look like:
conn = psycopg2.connect(...)
cur = conn.cursor()
cur.execute("COPY...")
conn.commit()

Related

Pyodbc Connection to Access, creating table with Pandas to_sql(method='multi') throwing errror

I've installed sql-alchemy Access so that I'm able to use pandas and pyodbc to query my Access DB's.
The issue is, it's incredibly slow because it does single row inserts. Another post suggested I use method='multi' and while it seems to work for whoever asked that question, it throws a CompileError for me.
CompileError: The 'access' dialect with current database version settings does not support in-place multirow inserts.
AttributeError: 'CompileError' object has no attribute 'orig'
import pandas as pd
import pyodbc
import urllib
from sqlalchemy import create_engine
connection_string = (
r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
rf"DBQ={accessDB};"
r"ExtendedAnsiSQL=1;"
)
connection_uri = f"access+pyodbc:///?odbc_connect={urllib.parse.quote_plus(connection_string)}"
engine = create_engine(connection_uri)
conn = engine.connect()
# Read in tableau SuperStore data
dfSS = pd.read_excel(ssData)
dfSS.to_sql('SuperStore', conn, index=False, method='multi')

Access SQL doesn't support multi-row inserts, so a to_sql will never be able to support them as well. That other post is probably using SQLite.
Instead, you can write the data frame to CSV, and insert the CSV by using a query.
Or, of course, not read the Excel in Python at all, but just insert the Excel file by query. This will always be much faster as Access can directly read the data instead of Python reading it and then transmitting it.
E.g.
INSERT INTO SuperStore
SELECT * FROM [Sheet1$] IN "C:\Path\To\File.xlsx"'Excel 12.0 Macro;HDR=Yes'
You should be able to execute this using pyodbc without needing to involve sqlalchemy. Do note the double and single quote combination, they can be a bit painful when embedding them in other programming languages.

Is there a SQLite equivalent to COPY from PostgreSQL?

I have local tab delimited raw data files "...\publisher.txt" and "...\field.txt" that I would like to load into a local SQLite database. The corresponding tables are already defined in the local database. I am accessing the database through the python-sql library in an ipython notebook. Is there a simple way to load these text files into the database?
CLI command 'readfile' doesn't seem to work in python context:
INSERT INTO Pub(k,p) VALUES('pubFile.txt',readfile('pubFile.txt'));
Throws error:
(sqlite3.OperationalError) no such function: readfile
[SQL: INSERT INTO Pub(k,p) VALUES('pubFile.txt',readfile('pubFile.txt'));]
(Background on this error at: http://sqlalche.me/e/e3q8)

No, there isn't such a command in SQLite (any longer). That feature was removed, and has been replaced by the SQLite CLI's .import statement.
See the official documentation:
The COPY command is available in SQLite version 2.8 and earlier. The COPY command has been removed from SQLite version 3.0 due to complications in trying to support it in a mixed UTF-8/16 environment. In version 3.0, the command-line shell contains a new command .import that can be used as a substitute for COPY.
The COPY command is an extension used to load large amounts of data into a table. It is modeled after a similar command found in PostgreSQL. In fact, the SQLite COPY command is specifically designed to be able to read the output of the PostgreSQL dump utility pg_dump so that data can be easily transferred from PostgreSQL into SQLite.
A sample code to load a text file into an SQLite database via the CLI is as below:
sqlite3 test.db ".import "test.txt" test_table_name"

You may read the input file into a string and then insert it:
sql = "INSERT INTO Pub (k, p) VALUES ('pubFile.txt', ?)"
with open ("pubFile.txt", "r") as myfile:
data = '\n'.join(myfile.readlines())
cur = conn.cursor()
cur.execute(sql, (data,))
conn.commit()

sqlite file is shown empty in python and r

I am trying to open a .sqlite3 file in python but I see no information is returned. So I tried r and still get empty for tables. I would like to know what tables are in this file.
I used the following code for python:
import sqlite3
from sqlite3 import Error
def create_connection(db_file):
""" create a database connection to the SQLite database
specified by the db_file
:param db_file: database file
:return: Connection object or None
"""
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return None
database = "D:\\...\assignee.sqlite3"
conn = create_connection(database)
cur = conn.cursor()
rows = cur.fetchall()
but rows are empty!
This is where I got the assignee.sqlite3 from:
https://github.com/funginstitute/downloads
I also tried RStudio, below is the code and results:
> con <- dbConnect(drv=RSQLite::SQLite(), dbname="D:/.../assignee")
> tables <- dbListTables(con)
But this is what I get

first make sure you provided correct path on your connection string to the sql
light db ,
use this conn = sqlite3.connect("C:\users\guest\desktop\example.db")
also make sure you are using the SQLite library in the unit tests and the production code
check the types of sqllite connection strings and determain which one your db belongs to :
Basic
Data Source=c:\mydb.db;Version=3;
Version 2 is not supported by this class library.
SQLite
In-Memory Database
An SQLite database is normally stored on disk but the database can also be
stored in memory. Read more about SQLite in-memory databases.
Data Source=:memory:;Version=3;New=True;
SQLite
Using UTF16
Data Source=c:\mydb.db;Version=3;UseUTF16Encoding=True;
SQLite
With password
Data Source=c:\mydb.db;Version=3;Password=myPassword;
so make sure you wrote the proper connection string for your sql lite db
if you still cannot see it, check if the disk containing /tmp full otherwise , it might be encrypted database, or locked and used by some other application maybe , you may confirm that by using one of the many tools for sql light database ,
you may downliad this tool , try to navigate directly to where your db exist and it will give you indication of the problem .
download windows version
Download Mac Version
Download linux version
good luck

How to use the DB2 LOAD utility using the python ibm_db driver

LOAD is a DB2 utility that I would like to use to insert data into a table from a CSV file. How can I do this in Python using the ibm_db driver? I don't see anything in the docs here
CMD: LOAD FROM xyz OF del INSERT INTO FOOBAR
Running this as standard SQL fails as expected:
Transaction couldn't be completed: [IBM][CLI Driver][DB2/LINUXX8664] SQL0104N An unexpected token "LOAD FROM xyz OF del" was found following "BEGIN-OF-STATEMENT". Expected tokens may include: "<space>". SQLSTATE=42601 SQLCODE=-104
Using the db2 CLP directly (i.e. os.system('db2 -f /path/to/script.file')) is not an option as DB2 sits on a different machine that I don't have SSH access to.
EDIT:
Using the ADMIN_CMD utility also doesn't work because the file being loaded cannot be put on the database server due to firewall. For now, I've switched to using INSERT

LOAD is an IBM command line processor command, not an SQL command. Is such, it isn't available through the ibm_db module.
The most typical way to do this would be to load the CSV data into Python (either all the rows or in batches if it is too large for memory) then use a bulk insert to insert many rows at once into the database.
To perform a bulk insert you can use the execute_many method.

You could CALL the ADMIN_CMD procedure. ADMIN_CMD has support for both LOAD and IMPORT. Note that both commands require the loaded/imported file to be on the database server.
The example is taken from the DB2 Knowledge Center:
CALL SYSPROC.ADMIN_CMD('load from staff.del of del replace
keepdictionary into SAMPLE.STAFF statistics use profile
data buffer 8')

CSV to DB2 with Python
Briefly: One solution is to use an SQLAlchemy adapter and Db2’s External Tables.
SQLAlchemy:
The Engine is the starting point for any SQLAlchemy application. It’s “home base” for the actual database and its DBAPI, delivered to the SQLAlchemy application through a connection pool and a Dialect, which describes how to talk to a specific kind of database/DBAPI combination.
Where above, an Engine references both a Dialect and a Pool, which together interpret the DBAPI’s module functions as well as the behavior of the database.
Creating an engine is just a matter of issuing a single call, create_engine():
dialect+driver://username:password#host:port/database
Where dialect is a database name such as mysql, oracle, postgresql, etc., and driver the name of a DBAPI, such as psycopg2, pyodbc, cx_oracle, etc.
Load data by using transient external table:
Transient external tables (TETs) provide a way to define an external table that exists only for the duration of a single query.
TETs have the same capabilities and limitations as normal external tables. A special feature of a TET is that you do not need to define the table schema when you use the TET to load data into a table or when you create the TET as the target of a SELECT statement.
Following is the syntax for a TET:
INSERT INTO <table> SELECT <column_list | *>
FROM EXTERNAL 'filename' [(table_schema_definition)]
[USING (external_table_options)];
CREATE EXTERNAL TABLE 'filename' [USING (external_table_options)]
AS select_statement;
SELECT <column_list | *> FROM EXTERNAL 'filename' (table_schema_definition)
[USING (external_table_options)];
For information about the values that you can specify for the external_table_options variable, see External table options.
General example
Insert data from a transient external table into the database table on the Db2 server by issuing the following command:
INSERT INTO EMPLOYEE SELECT * FROM external '/tmp/employee.dat' USING (delimiter ',' MAXERRORS 10 SOCKETBUFSIZE 30000 REMOTESOURCE 'JDBC' LOGDIR '/logs' )
Requirements
pip install ibm-db
pip install SQLAlchemy
Pyton code
One example below shows how it works together.
from sqlalchemy import create_engine
usr = "enter_username"
pwd = "enter_password"
hst = "enter_host"
prt = "enter_port"
db = "enter_db_name"
#SQL Alchemy URL
conn_params = "db2+ibm_db://{0}:{1}#{2}:{3}/{4}".format(usr, pwd, hst, prt, db)
shema = "enter_name_restore_shema"
table = "enter_name_restore_table"
destination = "/path/to/csv/file_name.csv"
try:
print("Connecting to DB...")
engine = create_engine(conn_params)
engine.connect() # optional, output: DB2/linux...
print("Successfully Connected!")
except Exception as e:
print("Unable to connect to the server.")
print(str(e))
external = """INSERT INTO {0}.{1} SELECT * FROM EXTERNAL '{2}' USING (CCSID 1208 DELIMITER ',' REMOTESOURCE LZ4 NOLOG TRUE )""".format(
shema, table, destination
)
try:
print("Restoring data to the server...")
engine.execute(external)
print("Data restored successfully.")
except Exception as e:
print("Unable to restore.")
print(str(e))
Conclusion
A great solution for restoredlarge files, specifically, 600m worked without any problems.
It is also useful for copying data from one table/database to another table. So that the backup is done as an export of csv and then that csv into DB2 with the given example.
SQLAlchemy-Engine can be combined with other databases such as: sqlite, mysql, postgresql, oracle, mssql, etc.

Python - Download a 20 gb of dataset/datadump from Oracle / Netezza server to my local disk drive using Python

want help for the following task :
I want to download a 20gb of dataset/datadump from Oracle server (oracle 11g database) to my local disk drive (i.e. E:/python/). I want to achieve this using Python 3.4 (windows64 bit ; I'm using Anaconda - spyder IDE)
I normaly use SAS for the task using following query:
LIBNAME ORACLE ODBC DSN= oracle UID= user PWD= password; #CONNECTION TO SERVER
LIBNAME LOCAL "E:/PYTHON"; #SETTING LOACAL PATH FOR DATA STORE
CREATE TABLE LOCAL.MYnewTable AS
SELECT * FROM ORACLE.DOWLOAD_TABLE
;QUIT;
Above query will download 20GB of datadump from the server to my local E:/ drive using SAS. How to do the same in Python?? My RAM is only 4gb so downloading entire 20gb dataset in the Pandas' data frame will eat up the RAM (i believe!! I may be wrong). SAS does this task very easily. Please suggest the query for Python. Request you all to share the code.
Thanks!!

Okey! So have gotten solution to my own question: This can be done using cxOracle as well, but i am using Python 3.5 and apparently cxOracle for python 3.5 is not available (to my knowledge) and that is why i have used "pyodbc" package
import csv
import pyodbc
conn = pyodbc.connect('''DRIVER=<<name of server connection in ODBC driver>>;
SERVER= <<server IP>> i.e.: <<00.00.00.00>>;
PORT= <<5000>>;
DATABASE=<<Server database name>>;
UID= <<xyz>>;
PWD= <<****>>;''')
# needs to be at the top of your module
def ResultIter(cursor, arraysize=1000):
'An iterator that uses fetchmany to keep memory usage down'
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
# where con is a DB-API 2.0 database connection object
cursor = conn.cursor()
cursor.execute('select * from <<table_name>>')
csvFile = open('stored_data_fetched_file.csv', 'a')
csvWriter = csv.writer(csvFile)
for result in ResultIter(cursor):
csvWriter.writerow(result)
csvFile.close()
This can be used for Netezza connection as well. Have tried and tested.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Copy Redshift table from S3 csv file using Python? - python

I use ODBC using the pyODBC library successfully. Just call .execute(copy-command) and you shouldn't have an issue.

Related

Pyodbc Connection to Access, creating table with Pandas to_sql(method='multi') throwing errror

Is there a SQLite equivalent to COPY from PostgreSQL?

sqlite file is shown empty in python and r

How to use the DB2 LOAD utility using the python ibm_db driver

Python - Download a 20 gb of dataset/datadump from Oracle / Netezza server to my local disk drive using Python

Categories

Resources