In my server I'm trying to read from a bunch of sqlite3 databases (sent from web clients) and process their data. The db files are in an S3 bucket and I have their url and I can open them in memory.
Now the problem is sqlite3.connect only takes an absolute path string and I can't pass to it a file in memory.
conn=sqlite3.connect() #how to pass file in memory or url
c=conn.cursor()
c.execute('''select * from data;''')
res=c.fetchall()
# other processing with res
SQLite requires database files to be stored on disk (it uses various locks and paging techniques). An in-memory file will not suffice.
I'd create a temporary directory to hold the database file, write it to that directory, then connect to it. The directory gives SQLite the space to write commit logs as well.
To handle all this, a context manager might be helpful:
import os.path
import shutil
import sqlite3
import sys
import tempfile
from contextlib import contextmanager
#contextmanager
def sqlite_database(inmemory_data):
path = tempfile.mkdtemp()
with open(os.path.join(path, 'sqlite.db'), 'wb') as dbfile:
dbfile.write(inmemory_data)
conn = None
try:
conn = sqlite3.connect(os.path.join(path, 'sqlite.db'))
yield conn
finally:
if conn is not None:
conn.close()
try:
shutil.rmtree(path)
except IOError:
sys.stderr.write('Failed to clean up temp dir {}'.format(path))
and use that as:
with sqlite_database(yourdata) as connection:
# query the database
This writes in-memory data to disk, opens a connection, lets you use that connection, and afterwards cleans up after you.
Related
I have csv file containing schema and table name (Format shared below). My task is to unload data from Redshift to S3 bucket in CSV file type. For this task I have below python script and I have 2 IAM access. First IAM access to unload data from Redshift. 2nd IAM access to write the data to S3 bucket. The issue that I am facing is using below script, I am able to create folder in my S3 bucket, however instead of CSV file the file type is " -" in S3 bucket. I am not sure what is the possible reason ?
Any help is much appreciated. Thanks in advance for your time and effort!
Note: I have millions of rows to unload from Redshift to S3 bucket.
CSV File containing schema and table name
Schema;tables
mmy_schema;my_table
Python Script
import csv
import redshift_connector
import sys
CSV_FILE="Tables.csv"
CSV_DELIMITER=';'
S3_DEST_PATH="s3://..../"
DB_HOST="MY HOST"
DB_PORT=1234
DB_DB="MYDB"
DB_USER="MY_READ"
DB_PASSWORD="MY_PSWD"
IM_ROLE="arn:aws:iam::/redshift-role/unload data","arn:aws::iam::/write in bucket"
def get_tables(path):
tables=[]
with open (path, 'r') as file:
csv_reader = csv.reader (file,delimiter=CSV_DELIMITER)
header = next(csv_reader)
if header != None:
for row in csv_reader:
tables.append(row)
return tables
def unload(conn, tables, s3_path):
cur = conn.cursor()
for table in tables:
print(f">{table[0]}.{table[1]}")
try:
query= f'''unload('select * from {table[0]}.{table[1]}' to '{s3_path}/{table[1]}/'
iam_role '{IAM_ROLE}'
CSV
PARALLEL FALSE
CLEANPATH;'''
print(f"loading in progress")
cur.execute(query)
print(f"Done.")
except Esception as e:
print("Failed to load")
print(str(e))
sys.exit(1)
cur.close()
def main():
try:
conn = redshift_connector.connect(
host=DB_HOST,
port=DB_PORT,
database= DB_DB,
user= DB_USER,
password=DB_PASSWORD
)
tables = get_tables(CSV_FILE)
unload(conn,tables,S3_DEST_PATH)
conn.close()
except Exception as e:
print(e)
sys.exit(1)
Redshift doesn't add file type suffixes on UNLOAD. It should just end with the part number. Yes, "parallel off" unloads can make multiple files. If it is required that these file names end in ".csv" then your script will need to issue the S3 rename API calls (copy and delete). This process should also check that the number of files and perform appropriate actions if needed.
I have a query to generate a CSV file from the data in a Postgres Table.The script is working fine.
But i have a situation where i need to create separate files using the data from a different table.
So basically only the below hardcoded one change and rest code is same.Now the situation is i have to create separate scripts for all CSV's.
Is there a way i can have one script and only change this parameters.
I'm using Jenkins to automate the CSV file creation.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
import csv
import os
import psycopg2
from pprint import pprint
from datetime import datetime
from utils.config import Configuration as Config
from utils.postgres_helper import get_connection
from utils.utils import get_global_config
# File path and name.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
# Database connection variable.
connect = None
# Check if the file path exists.
if os.path.exists(filePath):
try:
# Connect to database.
connect = get_connection(get_global_config(), 'dwh')
except psycopg2.DatabaseError as e:
# Confirm unsuccessful connection and stop program execution.
print("Database connection unsuccessful.")
quit()
# Cursor to execute query.
cursor = connect.cursor()
# SQL to select data from the google feed table.
sqlSelect = "SELECT * FROM data"
try:
# Execute query.
cursor.execute(sqlSelect)
# Fetch the data returned.
results = cursor.fetchall()
# Extract the table headers.
headers = [i[0] for i in cursor.description]
# Open CSV file for writing.
csvFile = csv.writer(open(filePath + fileName, 'w', newline=''),
delimiter=',', lineterminator='\r\n',
quoting=csv.QUOTE_ALL, escapechar='\\')
# Add the headers and data to the CSV file.
csvFile.writerow(headers)
csvFile.writerows(results)
# Message stating export successful.
print("Data export successful.")
print('CSV Path : '+ filePath+fileName)
except psycopg2.DatabaseError as e:
# Message stating export unsuccessful.
print("Data export unsuccessful.")
quit()
finally:
# Close database connection.
connect.close()
else:
# Message stating file path does not exist.
print("File path does not exist.")
Code:
import mysql.connector
import sys
def write_file(data, filename):
with open(filename, 'wb') as f:
f.write(data)
sampleNum = 0;
db_config = mysql.connector.connect(user='root', password='test',
host='localhost',
database='technical')
# query blob data form the authors table
cursor = db_config.cursor()
try:
sampleNum=sampleNum+1;
query = "SELECT fileAttachment FROM document_control WHERE id=%s"
cursor.execute(query,(sampleNum,))
file = cursor.fetchone()[0]
write_file(file, 'User'+str(sampleNum)+'.docx')
except AttributeError as e:
print(e)
finally:
cursor.close()
What it does
The above code - gets the file from MySQL stored as a BLOB and it saves me a .docx file into a folder.
Question
But instead of saving it, view it then delete it. Am I able to simply open the BLOB in word without saving it?
If so, how can it be done?
In general, passing binary data like a BLOB entity as a file-like object can be done with the built-in module io, for example:
import io
f = io.BytesIO(data)
# f now can be used anywhere a file-object is expected
But your question actually comes more down to MS Word's ability to open files that aren't saved anywhere on the disk. I don't think it can do that. Best practice would probably be to generate a temporary file using tempfile, so that you can at least expect the system to clean it up eventually:
import tempfile
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as f:
f.write(data)
print(f.name)
Edit:
In your code in particular, you could try the following to store the data in a temporary file and automatically open it in MS Word:
import tempfile, subprocess
WINWORD_PATH = r'C:\Program Files (x86)\Microsoft Office\Office14\winword.exe'
def open_as_temp_docx(data):
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as f:
f.write(data)
subprocess.Popen([WINWORD_PATH, f.name])
cursor = db_config.cursor()
try:
sampleNum=sampleNum+1;
query = "SELECT fileAttachment FROM document_control WHERE id=%s"
cursor.execute(query,(sampleNum,))
open_as_temp_docx(cursor.fetchone()[0])
I don't have a Windows machine with MS Word at hand, so I can't test this. The path to winword.exe on your machine may vary, so make sure it is correct.
Edit:
If it is important to delete the file as soon as MS Word closes, the following should work:
import tempfile, subprocess, os
WINWORD_PATH = r'C:\Program Files (x86)\Microsoft Office\Office14\winword.exe'
def open_as_temp_docx(data):
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as f:
f.write(data)
subprocess.Popen([WINWORD_PATH, f.name]).wait()
if os.path.exists(f.name):
os.unlink(f.name)
I have a requirement to upload file to MongoDB. Currently I am saving files in a folder in current filesystem using Flask. Is there a way I can upload file to MongoDB without using GridFS? I believe I did something like this long before but I cannot recollect since its been longtime since I last used MongoDB.
Any file I select to upload is no more than 16MB in size.
Update: I tried this to convert image file using binData but it throws error global name binData is not defined.
import pymongo
import base64
import bson
# establish a connection to the database
connection = pymongo.MongoClient()
#get a handle to the test database
db = connection.test
file_meta = db.file_meta
file_used = "Headshot.jpg"
def main():
coll = db.sample
with open(file_used, "r") as fin:
f = fin.read()
encoded = binData(f)
coll.insert({"filename": file_used, "file": f, "description": "test" })
Mongo BSON (https://docs.mongodb.com/manual/reference/bson-types/) has binary data (binData) type for field.
Python driver (http://api.mongodb.com/python/current/api/bson/binary.html) supports it.
You can store file as array of bytes.
You code should be slightly modified:
Add import: from bson.binary import Binary
Encode file bytes using Binary: encoded = Binary(f)
Use encoded value in insert statement.
Full example below:
import pymongo
import base64
import bson
from bson.binary import Binary
# establish a connection to the database
connection = pymongo.MongoClient()
#get a handle to the test database
db = connection.test
file_meta = db.file_meta
file_used = "Headshot.jpg"
def main():
coll = db.sample
with open(file_used, "rb") as f:
encoded = Binary(f.read())
coll.insert({"filename": file_used, "file": encoded, "description": "test" })
Given an sqlite3 connection object, how can retrieve the file path to the sqlite3 file?
The Python connection object doesn't store this information.
You could store the path before you open the connection:
path = '/path/to/database/file.db'
conn = sqlite3.connect(path)
or you can ask the database itself what connections it has, using the database_list pragma:
for id_, name, filename in conn.execute('PRAGMA database_list'):
if name == 'main' and filename is not None:
path = filename
break
If you used a connection URI (connecting with the sqlite3.connect() parameter uri=True), the filename will not include the URI parameters or the file:// prefix.
We can use the PRAGMA database_list command.
cur = con.cursor()
cur.execute("PRAGMA database_list")
rows = cur.fetchall()
for row in rows:
print(row[0], row[1], row[2])
The third parameter (row[2]) is the file name of the database.
Note that there could be more databases attached to SQLite engine.
$ ./list_dbs.py
0 main /home/user/dbs/test.db
2 movies /home/user/dbs/movies.db
The above is a sample output of a script that contains the Python code.
Referencing Martijn Pieters, except hardcoding is a must, you should do this:
path = os.path.dirname(os.path.abspath(__file__))
db = os.path.join(path, 'file.db')
conn = sqlite3.connect(db)