How to generalise the import script

How to generalise the import script - python

I have a query to generate a CSV file from the data in a Postgres Table.The script is working fine.
But i have a situation where i need to create separate files using the data from a different table.
So basically only the below hardcoded one change and rest code is same.Now the situation is i have to create separate scripts for all CSV's.
Is there a way i can have one script and only change this parameters.
I'm using Jenkins to automate the CSV file creation.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
import csv
import os
import psycopg2
from pprint import pprint
from datetime import datetime
from utils.config import Configuration as Config
from utils.postgres_helper import get_connection
from utils.utils import get_global_config
# File path and name.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
# Database connection variable.
connect = None
# Check if the file path exists.
if os.path.exists(filePath):
try:
# Connect to database.
connect = get_connection(get_global_config(), 'dwh')
except psycopg2.DatabaseError as e:
# Confirm unsuccessful connection and stop program execution.
print("Database connection unsuccessful.")
quit()
# Cursor to execute query.
cursor = connect.cursor()
# SQL to select data from the google feed table.
sqlSelect = "SELECT * FROM data"
try:
# Execute query.
cursor.execute(sqlSelect)
# Fetch the data returned.
results = cursor.fetchall()
# Extract the table headers.
headers = [i[0] for i in cursor.description]
# Open CSV file for writing.
csvFile = csv.writer(open(filePath + fileName, 'w', newline=''),
delimiter=',', lineterminator='\r\n',
quoting=csv.QUOTE_ALL, escapechar='\\')
# Add the headers and data to the CSV file.
csvFile.writerow(headers)
csvFile.writerows(results)
# Message stating export successful.
print("Data export successful.")
print('CSV Path : '+ filePath+fileName)
except psycopg2.DatabaseError as e:
# Message stating export unsuccessful.
print("Data export unsuccessful.")
quit()
finally:
# Close database connection.
connect.close()
else:
# Message stating file path does not exist.
print("File path does not exist.")

Related

Python script unloading the data from Redshift without any file type in S3 bucket. Need CSV file format

I have csv file containing schema and table name (Format shared below). My task is to unload data from Redshift to S3 bucket in CSV file type. For this task I have below python script and I have 2 IAM access. First IAM access to unload data from Redshift. 2nd IAM access to write the data to S3 bucket. The issue that I am facing is using below script, I am able to create folder in my S3 bucket, however instead of CSV file the file type is " -" in S3 bucket. I am not sure what is the possible reason ?
Any help is much appreciated. Thanks in advance for your time and effort!
Note: I have millions of rows to unload from Redshift to S3 bucket.
CSV File containing schema and table name
Schema;tables
mmy_schema;my_table
Python Script
import csv
import redshift_connector
import sys
CSV_FILE="Tables.csv"
CSV_DELIMITER=';'
S3_DEST_PATH="s3://..../"
DB_HOST="MY HOST"
DB_PORT=1234
DB_DB="MYDB"
DB_USER="MY_READ"
DB_PASSWORD="MY_PSWD"
IM_ROLE="arn:aws:iam::/redshift-role/unload data","arn:aws::iam::/write in bucket"
def get_tables(path):
tables=[]
with open (path, 'r') as file:
csv_reader = csv.reader (file,delimiter=CSV_DELIMITER)
header = next(csv_reader)
if header != None:
for row in csv_reader:
tables.append(row)
return tables
def unload(conn, tables, s3_path):
cur = conn.cursor()
for table in tables:
print(f">{table[0]}.{table[1]}")
try:
query= f'''unload('select * from {table[0]}.{table[1]}' to '{s3_path}/{table[1]}/'
iam_role '{IAM_ROLE}'
CSV
PARALLEL FALSE
CLEANPATH;'''
print(f"loading in progress")
cur.execute(query)
print(f"Done.")
except Esception as e:
print("Failed to load")
print(str(e))
sys.exit(1)
cur.close()
def main():
try:
conn = redshift_connector.connect(
host=DB_HOST,
port=DB_PORT,
database= DB_DB,
user= DB_USER,
password=DB_PASSWORD
)
tables = get_tables(CSV_FILE)
unload(conn,tables,S3_DEST_PATH)
conn.close()
except Exception as e:
print(e)
sys.exit(1)

Redshift doesn't add file type suffixes on UNLOAD. It should just end with the part number. Yes, "parallel off" unloads can make multiple files. If it is required that these file names end in ".csv" then your script will need to issue the S3 rename API calls (copy and delete). This process should also check that the number of files and perform appropriate actions if needed.

How to open a BLOB file from MySQL using Python without saving?

Code:
import mysql.connector
import sys
def write_file(data, filename):
with open(filename, 'wb') as f:
f.write(data)
sampleNum = 0;
db_config = mysql.connector.connect(user='root', password='test',
host='localhost',
database='technical')
# query blob data form the authors table
cursor = db_config.cursor()
try:
sampleNum=sampleNum+1;
query = "SELECT fileAttachment FROM document_control WHERE id=%s"
cursor.execute(query,(sampleNum,))
file = cursor.fetchone()[0]
write_file(file, 'User'+str(sampleNum)+'.docx')
except AttributeError as e:
print(e)
finally:
cursor.close()
What it does
The above code - gets the file from MySQL stored as a BLOB and it saves me a .docx file into a folder.
Question
But instead of saving it, view it then delete it. Am I able to simply open the BLOB in word without saving it?
If so, how can it be done?

In general, passing binary data like a BLOB entity as a file-like object can be done with the built-in module io, for example:
import io
f = io.BytesIO(data)
# f now can be used anywhere a file-object is expected
But your question actually comes more down to MS Word's ability to open files that aren't saved anywhere on the disk. I don't think it can do that. Best practice would probably be to generate a temporary file using tempfile, so that you can at least expect the system to clean it up eventually:
import tempfile
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as f:
f.write(data)
print(f.name)
Edit:
In your code in particular, you could try the following to store the data in a temporary file and automatically open it in MS Word:
import tempfile, subprocess
WINWORD_PATH = r'C:\Program Files (x86)\Microsoft Office\Office14\winword.exe'
def open_as_temp_docx(data):
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as f:
f.write(data)
subprocess.Popen([WINWORD_PATH, f.name])
cursor = db_config.cursor()
try:
sampleNum=sampleNum+1;
query = "SELECT fileAttachment FROM document_control WHERE id=%s"
cursor.execute(query,(sampleNum,))
open_as_temp_docx(cursor.fetchone()[0])
I don't have a Windows machine with MS Word at hand, so I can't test this. The path to winword.exe on your machine may vary, so make sure it is correct.
Edit:
If it is important to delete the file as soon as MS Word closes, the following should work:
import tempfile, subprocess, os
WINWORD_PATH = r'C:\Program Files (x86)\Microsoft Office\Office14\winword.exe'
def open_as_temp_docx(data):
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as f:
f.write(data)
subprocess.Popen([WINWORD_PATH, f.name]).wait()
if os.path.exists(f.name):
os.unlink(f.name)

Python on Hadoop read blocks

I have the following problem. I want to extract data from hdfs (a table called 'complaint'). I wrote the following script which actually works:
import pandas as pd
from hdfs import InsecureClient
import os
file = open ("test.txt", "wb")
print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
print('new line')
features = reader.read(1000000)
file.write(features)
print('end')
file.close()
My problem now is that the folder "complaint" contains 4 files ( i don't know which file type) and the read operation gives me back bytes which I can't use further (I saved it to a textfile as a test and it looks like that:
In HDFS it looks like this:
My question now is:
Is it possible to get the data separated for each column in a senseful way?
I only found solutions with .csv files and like that and somehow stuck here... :-)
EDIT
I made changes to my solution and tried different approaches but none of them is going to work really. Here's the updated code:
import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive
#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')
#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)
#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)
#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)
#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')
#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")
#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)
#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)
#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
# df = pd.read_parquet(f)
#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
features = pd.read_parquet(reader)
print (features)
#features = reader.read()
#data = features.decode('utf-8', 'replace')
print("saving data to file")
file.write(data)
print('end')
file.close()

Python cx_Oracle and csv extracts saved differently with different executions

I'm working on a Python Script that runs queries against an Oracle Database and saves the results to csv. The greater plan is to use regular extracts with a separate application to check differences in the files through hashing.
The issue I've run into is that my script has so far saved some fields in the extracts in different formats. For example, saving a field as an integer in one extract and as a float in the next, and saving a date at 2000/01/01 in one and 2000-01-01 in another.
These changes are giving my difference check script a fit. What can I do to ensure that every extract is saved the same way, while keeping the script generic enough to run arbitrary queries?
import sys
import traceback
import cx_Oracle
from Creds import creds
import csv
import argparse
import datetime
try:
conn = cx_Oracle.connect(
'{username}/{password}#{address}/{dbname}'.format(**creds)
)
except cx_Oracle.Error as e:
print('Unable to connect to database.')
print()
print(''.join(traceback.format_exception(*sys.exc_info())), file=sys.stderr)
sys.exit(1)
def run_extract(query, out):
"""
Run the given query and save results to given out path.
:param query: Query to be executed.
:param out: Path to store results in csv.
"""
cur = conn.cursor()
try:
cur.execute(query)
except cx_Oracle.DatabaseError as e:
print('Unable to run given query.')
print()
print(query)
print()
print(''.join(traceback.format_exception(*sys.exc_info())), file=sys.stderr)
sys.exit(1)
with open(out, 'w', newline='') as out_file:
wrt = csv.writer(out_file)
header = []
for column in cur.description:
header.append(column[0])
wrt.writerow(header)
for row in cur:
wrt.writerow(row)
cur.close()
def read_sql(file_path):
"""
Read the SQL from a given filepath and return as a string.
:param file_path: File path location of the file to read.
"""
try:
with open(file_path, 'r') as file:
return file.read()
except FileNotFoundError as e:
print('File not found a given path.')
print()
print(file_path)
print()
print(''.join(traceback.format_exception(*sys.exc_info())), file=sys.stderr)
sys.exit(1)
def generate_timestamp_path(path):
"""
Add a timestamp to the beginning of the given path.
:param path: File path for the timestamp to be added to.
"""
if '/' in args.out_file:
sep = '/'
elif '\\' in args.out_file:
sep = '\\'
else:
sep = ''
path = args.out_file.split(sep)
stamp = datetime.datetime.now().strftime('%Y%m%dT%H%M%S ')
path[-1] = stamp + path[-1]
return sep.join(path)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
in_group = parser.add_mutually_exclusive_group()
in_group.add_argument('-q', '--query', help='String of the query to run.')
in_group.add_argument('-f', '--in_file', help='File of the query to run.')
parser.add_argument('-o', '--out_file', help='Path to file to store.')
parser.add_argument('-t', '--timestamp',
help='Store the file with a preceding timestamp.',
action='store_true')
args = parser.parse_args()
if not args.out_file:
print('Please provide a path to put the query results with -o.')
sys.exit(1)
if args.timestamp:
path = generate_timestamp_path(args.out_file)
else:
path = args.out_file
if args.query:
query = args.query
elif args.in_file:
query = read_sql(args.in_file)
else:
print('Please provide either a query string with -q',
'or a SQL file with -f.')
sys.exit(1)
run_extract(query, path)

Your code is simply using the default transformations for all data types. Note that an Oracle type of number(9) will be returned as an integer but number by itself will be returned as a float. You may wish to use an outputtypehandler in order to place this under your control a bit more firmly. :-) Examples for doing so are in the cx_Oracle distribution samples directory (ReturnLongs and ReturnUnicode). All of that said, though, the same data definition in Oracle will always return the same type in Python -- so I would suspect that you are referencing different data types in Oracle that you would prefer to see processed the same way.

Cannot import all .JSON files into MongoDB

ISSUE RESOLVED: So it turns out that I never actually had an issue in the first place. When I did a count on the number of records to determine how many records I should expect to be imported, blank spaces between .json objects were being added towards the total record count. However, upon importing, only the objects with content were moved. I'll just leave this post here for reference anyway. Thank you to those who contributed regardless.
I have around ~33GB of .JSON files that were retrieved from Twitter's streaming API stored in a local directory. I am trying to import this data into a MongoDB collection. I have made two attempts:
First attempt: read through each file individually (~70 files). This successfully imported 11,171,885/ 22,343,770 documents.
import json
import glob
from pymongo import MongoClient
directory = '/data/twitter/output/*.json'
client = MongoClient("localhost", 27017)
db = client.twitter
collection = db.test
jsonFiles = glob.glob(directory)
for file in jsonFiles:
f = open(file, 'r')
for line in f.read().split("\n"):
if line:
try:
lineJson = json.loads(line)
except (ValueError, KeyError, TypeError) as e:
pass
else:
postid = collection.insert(lineJson)
print 'inserted with id: ' , postid
f.close()
Second attempt: Concatenate each .JSON file into one large file. This successfully import 11,171,879/ 22,343,770 documents.
import json
import os
from pymongo import MongoClient
import sys
client = MongoClient("localhost", 27017)
db = client.tweets
collection = db.test
script_dir = os.path.dirname(__file__)
file_path = os.path.join(script_dir, '/data/twitter/blob/historical-tweets.json')
try:
with open(file_path, 'r') as f:
for line in f.read().split("\n"):
if line:
try:
lineJson = json.loads(line)
except (ValueError, KeyError, TypeError) as e:
pass
else:
postid = collection.insert(lineJson)
print 'inserted with id: ' , postid
f.close()
The python script did not error out and output a traceback, it simply stopped running. Any ideas to what could be causing this? Or any alternative solutions to importing the data more efficiently? Thanks in advance.

You are reading the file one line at the time. Is each line really valid json? If not, json.loads will trace and you are hiding that trace with the pass statement.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to generalise the import script - python

Related

Python script unloading the data from Redshift without any file type in S3 bucket. Need CSV file format

How to open a BLOB file from MySQL using Python without saving?

Python on Hadoop read blocks

Python cx_Oracle and csv extracts saved differently with different executions

Cannot import all .JSON files into MongoDB

Categories

Resources