Efficient importing CSVs into Oracle Table (Python)

Efficient importing CSVs into Oracle Table (Python) - python

I am using Python 3.6 to iterate through a folder structure and return the file paths of all these CSVs I want to import into two already created Oracle tables.
con = cx_Oracle.connect('BLAH/BLAH#XXX:666/BLAH')
#Targets the exact filepaths of the CSVs we want to import into the Oracle database
if os.access(base_cust_path, os.W_OK):
for path, dirs, files in os.walk(base_cust_path):
if "Daily" not in path and "Daily" not in dirs and "Jul" not in path and "2017-07" not in path:
for f in files:
if "OUTPUT" in f and "MERGE" not in f and "DD" not in f:
print("Import to OUTPUT table: "+ path + "/" + f)
#Run function to import to SQL Table 1
if "MERGE" in f and "OUTPUT" not in f and "DD" not in f:
print("Import to MERGE table: "+ path + "/" + f)
#Run function to import to SQL Table 2
A while ago I was able to use PHP to produce a function that used the BULK INSERT SQL command for SQL Server:
function bulkInserttoDB($csvPath){
$tablename = "[DATABASE].[dbo].[TABLE]";
$insert = "BULK
INSERT ".$tablename."
FROM '".$csvPath."'
WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\\n')";
print_r($insert);
print_r("<br>");
$result = odbc_prepare($GLOBALS['connection'], $insert);
odbc_execute($result)or die(odbc_error($connection));
}
I was looking to replicate this for Python, but a few Google searches left me to believe there is no 'BULK INSERT' command for Oracle. This BULK INSERT command had awesome performance.
Since these CSVs I am loading are huge (2GB x 365), performance is crucial. What is the most efficient way of doing this?

The bulk insert is made using the cx_oracle library and the commands
con = cx_Oracle.connect(CONNECTION_STRING)
cur= con.cursor()
cur.prepare("INSERT INTO MyTable values (
to_date(:1,'YYYY/MM/DD HH24:MI:SS'),
:2,
:3,
to_date(:4,'YYYY/MM/DD HH24:MI:SS'),
:5,
:6,
to_date(:7,'YYYY/MM/DD HH24:MI:SS'),
:8,
to_date(:9,'YYYY/MM/DD HH24:MI:SS'))"
) ##prepare your statment
list.append((sline[0],sline[1],sline[2],sline[3],sline[4],sline[5],sline[6],sline[7],sline[8])) ##prepare your data
cur.executemany(None, list) ##insert
you prepare an insert statement. Then you store your file and your list. finally you execute the many. It will paralyze everything.

Related

How do I store file names in a database in python?

I am trying to get a list of files in a user specified directory to be saved to a database. What I have at the moment is :
import os
import sqlite3
def get_list():
folder = input("Directory to scan : ")
results = []
for path in os.listdir(folder):
if os.path.isfile(os.path.join(folder, path)):
results.append(path)
print(results)
return results
def populate(results):
connection = sqlite3.connect("videos.db")
with connection:
connection.execute("CREATE TABLE IF NOT EXISTS files (id INTEGER PRIMARY KEY, file_name TEXT);")
for filename in results:
insert_string = "INSERT INTO files (file_name) VALUES ('"+filename+"');"
connection.execute(insert_string)
filelist = get_list()
populate(filelist)
It runs without a problem and prints out a list of the file names, which is great, but then when it's running the INSERT SQL statement, that seems to have no effect on the database table. I have tried to debug it, and the statement which is saved in the variable looks good, and when executing it manually in the console, it inserts a row in the table, but when running it, nothing changes. Am I missing something really simple here ?

Python's SQLite3 module doesn't auto-commit by default, so you need to call connection.commit() after you've finished executing queries. This is covered in the tutorial.
In addition, use ? placeholders to avoid SQL injection issues:
cur.execute('INSERT INTO files (file_name) VALUES (?)', (filename,))
Once you do that, you can insert all of your filenames at once using executemany:
cur.executemany(
'INSERT INTO files (file_name) VALUES (?)',
[(filename,) for filename in results],
)

Retrieve zipped file from bytea column in PostgreSQL using Python

I have a table in my PostgreSQL database in which a column type is set to bytea in order to store zipped files.
The storing procedure works fine. I have problems when I need to retrieve the zipped file I uploaded.
def getAnsibleByLibrary(projectId):
con = psycopg2.connect(
database="xyz",
user="user",
password="pwd",
host="localhost",
port="5432",
)
print("Database opened successfully")
cur = con.cursor()
query = "SELECT ansiblezip FROM library WHERE library.id = (SELECT libraryid from project WHERE project.id = '"
query += str(projectId)
query += "')"
cur.execute(query)
rows = cur.fetchall()
repository = rows[0][0]
con.commit()
con.close()
print(repository, type(repository))
with open("zippedOne.zip", "wb") as fin:
fin.write(repository)
This code creates a zippedOne.zip file but it seems to be an invalid archive.
I tried also saving repository.tobytes() but it gives the same result.
I don't understand how I can handle memoriview objects.
If I try:
print(repository, type(repository))
the result is:
<memory at 0x7f6b62879348> <class 'memoryview'>
If I try to unzip the file:
chain#wraware:~$ unzip zippedOne.zip
The result is:
Archive: zippedOne.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of zippedOne.zip or
zippedOne.zip.zip, and cannot find zippedOne.zip.ZIP, period.
Trying to extract it in windows gives me the error: "The compressed (zipped) folder is invalid"

This code, based on the example in the question, works for me:
import io
import zipfile
import psycopg2
DROP = """DROP TABLE IF EXISTS so69434887"""
CREATE = """\
CREATE TABLE so69434887 (
id serial primary key,
ansiblezip bytea
)
"""
buf = io.BytesIO()
with zipfile.ZipFile(buf, mode='w') as zf:
zf.writestr('so69434887.txt', 'abc')
with psycopg2.connect(database="test") as conn:
cur = conn.cursor()
cur.execute(DROP)
cur.execute(CREATE)
conn.commit()
cur.execute("""INSERT INTO so69434887 (ansiblezip) VALUES (%s)""", (buf.getvalue(),))
conn.commit()
cur.execute("""SELECT ansiblezip FROM so69434887""")
memview, = cur.fetchone()
with open('so69434887.zip', 'wb') as f:
f.write(memview)
and is unzippable (on Linux, at least)
$ unzip -p so69434887.zip so69434887.txt
abc
So perhaps the data is not being inserted correctly.
FWIW I got the "End-of-central-directory signature not found" until I made sure I closed the zipfile object before writing to the database.

How to save files from postgreSQL to local?

I have a requirement that there are a lot of files (such as image, .csv) saved in a table hosted in Azure PostgreSQL. Files are saved as binary data type. Is it possible extract them directly to local file system by SQL query? I am using python as my programming language, any guide or code sample is appreciated, thanks!

If you just want to extract binary files from SQL to local and save as a file, try the code below:
import psycopg2
import os
connstr = "<conn string>"
rootPath = "d:/"
def saveBinaryToFile(sqlRowData):
destPath = rootPath + str(sqlRowData[1])
if(os.path.isdir(destPath)):
destPath +='_2'
os.mkdir(destPath)
else:
os.mkdir(destPath)
newfile = open(destPath +'/' + sqlRowData[0]+".jpg", "wb");
newfile.write(sqlRowData[2])
newfile.close
conn = psycopg2.connect(connstr)
cur = conn.cursor()
sql = 'select * from images'
cur.execute(sql)
rows = cur.fetchall()
print(sql)
print('result:' + str(rows))
for i in range(len(rows)):
saveBinaryToFile(rows[i])
conn.close()
This is my sample SQL table :
Result:

Python script hangs when executing long running query, even after query completes

I've got a Python script that loops through folders and within each folder, executes the sql file against our Redshift cluster (using psycopg2). Here is the code that does the loop (note: this works just fine for queries that take only a few minutes to execute):
for folder in dir_list:
#Each query is stored in a folder by group, so we have to go through each folder and then each file in that folder
file_list = os.listdir(source_dir_wkly + "\\" + str(folder))
for f in file_list:
src_filename = source_dir_wkly + "\\" + str(folder) + "\\" + str(f)
dest_filename = dest_dir_wkly + "\\" + os.path.splitext(os.path.basename(src_filename))[0] + ".csv"
result = dal.execute_query(src_filename)
result.to_csv(path_or_buf=dest_filename,index=False)
execute_query is a method stored in another file:
def execute_query(self, source_path):
conn_rs = psycopg2.connect(self.conn_string)
cursor = conn_rs.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
sql_file = self.read_sql_file(source_path)
cursor.execute(sql_file)
records = cursor.fetchall()
conn_rs.commit()
return pd.DataFrame(data=records)
def read_sql_file(self, path):
sql_path = path
f = open(sql_path, 'r')
return f.read()
I have a couple queries that take around 15 minutes to execute (not unusual given the size of the data in our Redshift cluster), and they execute just fine in SQL Workbench. I can see in the AWS Console that the query has completed, but the script just hangs and doesn't dump the results to a csv file, nor does it proceed to the next file in the folder.
I don't have any timeouts specified. Is there anything else I'm missing?

The line records = cursor.fetchall() is likely the culprit. It reads all data and hence loads all results from the query into memory. Given that your queries are very large, that data probably cannot all be loaded into memory at once.
You should iterate over the results from the cursor and write into your csv one by one. In general trying to read all data from a database query at once is not a good idea.
You will need to refactor your code to do so:
for record in cursor:
csv_fh.write(record)
Where csv_fh is a file handle to your csv file. Your use of pd.DataFrame will need rewriting as it looks like it expects all data to be passed to it.

Python, converting CSV file to SQL table

I have a CSV file without headers and am trying to create a SQL table from certain columns in the file. I tried the solutions given here: Importing a CSV file into a sqlite3 database table using Python,
but keep getting the error that col1 is not defined. I then tried inserting headers in my CSV file and am still getting a KeyError.
Any help is appreciated! (I am not very familiar with SQL at all)

If the .csv file has no headers, you don't want to use DictReader; DictReader assumes line 1 is a set of headers and uses them as keys for every subsequent line. This is probably why you're getting KeyErrors.
A modified version of the example from that link:
import csv, sqlite3
con = sqlite3.connect(":memory:")
cur = con.cursor()
cur.execute("CREATE TABLE t (col1, col2);")
with open('data.csv','rb') as fin:
dr = csv.reader(fin)
dicts = ({'col1': line[0], 'col2': line[1]} for line in dr)
to_db = ((i['col1'], i['col2']) for i in dicts)
cur.executemany("INSERT INTO t (col1, col2) VALUES (?, ?);", to_db)
con.commit()

This below code, will read all the csv files from the path and load all the data into table present in sqllite 3 database.
import sqllite3
import io
import os.path
import glob
cnx = sqlite3.connect(user='user', host='localhost', password='password',
database='dbname')
cursor=cnx.cursor(buffered= True);
path ='path/*/csv'
for files in glob.glob(path + "/*.csv"):
add_csv_file="""LOAD DATA LOCAL INFILE '%s' INTO TABLE tabkename FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;;;""" %(files)
print ("add_csv_file: %s" % files)
cursor.execute(add_csv_file)
cnx.commit()
cursor.close();
cnx.close();
Let me know if this works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient importing CSVs into Oracle Table (Python) - python

Related

How do I store file names in a database in python?

Retrieve zipped file from bytea column in PostgreSQL using Python

How to save files from postgreSQL to local?

Python script hangs when executing long running query, even after query completes

Python, converting CSV file to SQL table

Categories

Resources