Python cx_Oracle and csv extracts saved differently with different executions - python

I'm working on a Python Script that runs queries against an Oracle Database and saves the results to csv. The greater plan is to use regular extracts with a separate application to check differences in the files through hashing.
The issue I've run into is that my script has so far saved some fields in the extracts in different formats. For example, saving a field as an integer in one extract and as a float in the next, and saving a date at 2000/01/01 in one and 2000-01-01 in another.
These changes are giving my difference check script a fit. What can I do to ensure that every extract is saved the same way, while keeping the script generic enough to run arbitrary queries?
import sys
import traceback
import cx_Oracle
from Creds import creds
import csv
import argparse
import datetime
try:
conn = cx_Oracle.connect(
'{username}/{password}#{address}/{dbname}'.format(**creds)
)
except cx_Oracle.Error as e:
print('Unable to connect to database.')
print()
print(''.join(traceback.format_exception(*sys.exc_info())), file=sys.stderr)
sys.exit(1)
def run_extract(query, out):
"""
Run the given query and save results to given out path.
:param query: Query to be executed.
:param out: Path to store results in csv.
"""
cur = conn.cursor()
try:
cur.execute(query)
except cx_Oracle.DatabaseError as e:
print('Unable to run given query.')
print()
print(query)
print()
print(''.join(traceback.format_exception(*sys.exc_info())), file=sys.stderr)
sys.exit(1)
with open(out, 'w', newline='') as out_file:
wrt = csv.writer(out_file)
header = []
for column in cur.description:
header.append(column[0])
wrt.writerow(header)
for row in cur:
wrt.writerow(row)
cur.close()
def read_sql(file_path):
"""
Read the SQL from a given filepath and return as a string.
:param file_path: File path location of the file to read.
"""
try:
with open(file_path, 'r') as file:
return file.read()
except FileNotFoundError as e:
print('File not found a given path.')
print()
print(file_path)
print()
print(''.join(traceback.format_exception(*sys.exc_info())), file=sys.stderr)
sys.exit(1)
def generate_timestamp_path(path):
"""
Add a timestamp to the beginning of the given path.
:param path: File path for the timestamp to be added to.
"""
if '/' in args.out_file:
sep = '/'
elif '\\' in args.out_file:
sep = '\\'
else:
sep = ''
path = args.out_file.split(sep)
stamp = datetime.datetime.now().strftime('%Y%m%dT%H%M%S ')
path[-1] = stamp + path[-1]
return sep.join(path)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
in_group = parser.add_mutually_exclusive_group()
in_group.add_argument('-q', '--query', help='String of the query to run.')
in_group.add_argument('-f', '--in_file', help='File of the query to run.')
parser.add_argument('-o', '--out_file', help='Path to file to store.')
parser.add_argument('-t', '--timestamp',
help='Store the file with a preceding timestamp.',
action='store_true')
args = parser.parse_args()
if not args.out_file:
print('Please provide a path to put the query results with -o.')
sys.exit(1)
if args.timestamp:
path = generate_timestamp_path(args.out_file)
else:
path = args.out_file
if args.query:
query = args.query
elif args.in_file:
query = read_sql(args.in_file)
else:
print('Please provide either a query string with -q',
'or a SQL file with -f.')
sys.exit(1)
run_extract(query, path)

Your code is simply using the default transformations for all data types. Note that an Oracle type of number(9) will be returned as an integer but number by itself will be returned as a float. You may wish to use an outputtypehandler in order to place this under your control a bit more firmly. :-) Examples for doing so are in the cx_Oracle distribution samples directory (ReturnLongs and ReturnUnicode). All of that said, though, the same data definition in Oracle will always return the same type in Python -- so I would suspect that you are referencing different data types in Oracle that you would prefer to see processed the same way.

Related

Tips for increasing performance for bson.json_util dump function

I have a service that reads from mongo and needs to dump all the records with the same metadata_id into a local temp file. Is there a way to optimize/speed up the bson.json_util dumping portion?
The querying part, where everything is loaded into the cursor always takes less than 30sec for hundreads of Mbs, but then the dumping part takes around 1h.
It took 3days to archive ~0.2TB of data.
def dump_snapshot_to_local_file(mongo, database, collection, metadata_id, file_path, dry_run=False):
"""
Creates a gz archive for all documents with the same metadata_id
"""
cursor = mongo_find(mongo.client[database][collection], match={"metadata_id": ObjectId(metadata_id)})
path = file_path + '/' + database + '/' + collection + '/'
create_directory(path)
path = path + metadata_id + '.json.gz'
ok = False
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving to temp location %s", path)
file.write(b'{"documents":[')
for document in cursor:
if ok:
file.write(b',')
ok = True
file.write(dumps(document).encode())
file.write(b']}')
except IOError as e:
logging.error("Failed exporting data with metadata_id %s to gz. Error: %s", metadata_id, e.strerror)
return False
finally:
file.close()
if not is_gz_file(path):
logging.error("Failed to create gzip file for data with metadata_id %", metadata_id)
return False
logging.info("Data with metadata_id %s was successfully saved at temp location", metadata_id)
return True
Would there be a better approach to do this?
Any tip would be greatly appreciated.
Since I wasn't using any of the JSONOptions functionality, and the service was spending most of its time doing the json_util dumps, stepping away from it and dumping directly into bson, without the json conversion, proved to save 35min from the orginial 40min load (on a 1.8mil documents, ~3.5GB)
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving snapshot to temp location %s", path)
for document in cursor:
file.write(bson.BSON.encode(document))

How to generalise the import script

I have a query to generate a CSV file from the data in a Postgres Table.The script is working fine.
But i have a situation where i need to create separate files using the data from a different table.
So basically only the below hardcoded one change and rest code is same.Now the situation is i have to create separate scripts for all CSV's.
Is there a way i can have one script and only change this parameters.
I'm using Jenkins to automate the CSV file creation.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
import csv
import os
import psycopg2
from pprint import pprint
from datetime import datetime
from utils.config import Configuration as Config
from utils.postgres_helper import get_connection
from utils.utils import get_global_config
# File path and name.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
# Database connection variable.
connect = None
# Check if the file path exists.
if os.path.exists(filePath):
try:
# Connect to database.
connect = get_connection(get_global_config(), 'dwh')
except psycopg2.DatabaseError as e:
# Confirm unsuccessful connection and stop program execution.
print("Database connection unsuccessful.")
quit()
# Cursor to execute query.
cursor = connect.cursor()
# SQL to select data from the google feed table.
sqlSelect = "SELECT * FROM data"
try:
# Execute query.
cursor.execute(sqlSelect)
# Fetch the data returned.
results = cursor.fetchall()
# Extract the table headers.
headers = [i[0] for i in cursor.description]
# Open CSV file for writing.
csvFile = csv.writer(open(filePath + fileName, 'w', newline=''),
delimiter=',', lineterminator='\r\n',
quoting=csv.QUOTE_ALL, escapechar='\\')
# Add the headers and data to the CSV file.
csvFile.writerow(headers)
csvFile.writerows(results)
# Message stating export successful.
print("Data export successful.")
print('CSV Path : '+ filePath+fileName)
except psycopg2.DatabaseError as e:
# Message stating export unsuccessful.
print("Data export unsuccessful.")
quit()
finally:
# Close database connection.
connect.close()
else:
# Message stating file path does not exist.
print("File path does not exist.")

python argparse.ArgumentParser read config file

Im trying to add the switch -c and specify the config file.
I have it working at the moment using the config.dat but when i use -c and specify a new .dat it uses the default config.dat....
Any idea where im going wrong?
#!/usr/bin/python3
import argparse
import shutil
parser = argparse.ArgumentParser(description='Copy multiple Files from a specified data file')
parser.add_argument('-c', '--configfile', default="config.dat",help='file to read the config from')
def read_config(data):
try:
dest = '/home/admin/Documents/backup/'
#Read in date from config.dat
data = open('config.dat')
#Interate through list of files '\n'
filelist = data.read().split('\n')
#Copy through interated list and strip white spaces and empty lines
for file in filelist:
if file:
shutil.copy(file.strip(), dest)
except FileNotFoundError:
pass
args =parser.parse_args()
read = read_config(args.configfile)
args =parser.parse_args()
Take a close look at what you are doing on line 14. Even though you are retrieving and assigning the --configfile argument to args you are still using a string literal data = open('config.dat') instead of passing data (which is the value of your argument for the configfile passed as an argument to the function read_config):
def read_config(data):
try:
dest = '/home/admin/Documents/backup/'
#Read in date from config.dat
data = open(data)
...
I would also change the naming of the argument data you are passing to read_config-- it's a bit ambiguous. You know that this function expects a file name as an argument so why not simply call it filename.
def read_config(filename):
import pdb; pdb.set_trace()
try:
dest = '/home/admin/Documents/backup/'
#Read in date from config.dat
data = open(filename)
#Interate through list of files '\n'
filelist = data.read().split('\n')
#Copy through interated list and strip white spaces and empty lines
for file in filelist:
if file:
shutil.copy(file.strip(), dest)
except FileNotFoundError:
pass
This code works by converting the args to a dictionary, then getting the value via key. Also, the code you had on line 13 didn't open the passed in value. This one opens the passed in file. See if this works for you:
# !/usr/bin/python3
import argparse
import shutil
parser = argparse.ArgumentParser(description='Copy multiple Files from a specified data file')
parser.add_argument('-c', '--configfile', default="config.dat", help='file to read the config from')
def read_config(data):
try:
dest = '/home/admin/Documents/backup/'
# Read in date from config.dat
data = open(data)
# Interate through list of files '\n'
filelist = data.read().split('\n')
# Copy through interated list and strip white spaces and empty lines
for file in filelist:
if file:
shutil.copy(file.strip(), dest)
except FileNotFoundError:
pass
args = vars(parser.parse_args())
read = read_config(args['configfile'])
Make proper use of the function argument; names changed to clarify the nature of the variables.
def read_config(filename='config.dat'):
try:
dest = '/home/admin/Documents/backup/'
afile = open(filename)
#Interate through list of files '\n'
filelist = afile.read().split('\n')
#Copy through interated list and strip white spaces and empty lines
for file in filelist:
if file:
shutil.copy(file.strip(), dest)
except FileNotFoundError:
pass

Why is this function taking so long to evaluate?

I am attempting to make a function that records a few strings to a logfile on a server. For some reason, this function takes forever to run through..about 20 seconds before it will return the Exception. I think it's the try: statement with the file open command.
Any ideas how I can do this correctly?
def writeUserRecord():
""" Given a path, logs user name, fetch version, and time"""
global fetchVersion
global fetchHome
filename = 'users.log'
logFile = os.path.normpath(os.path.join(fetchHome, filename))
timeStamp = str(datetime.datetime.now()).split('.')[0]
userID = getpass.getuser()
try:
file = open(logFile, 'a')
file.write('{} {} {}'.format(userID, timeStamp, fetchVersion))
file.close()
except IOError:
print('Error Accessing Log File')
pass

Cannot import all .JSON files into MongoDB

ISSUE RESOLVED: So it turns out that I never actually had an issue in the first place. When I did a count on the number of records to determine how many records I should expect to be imported, blank spaces between .json objects were being added towards the total record count. However, upon importing, only the objects with content were moved. I'll just leave this post here for reference anyway. Thank you to those who contributed regardless.
I have around ~33GB of .JSON files that were retrieved from Twitter's streaming API stored in a local directory. I am trying to import this data into a MongoDB collection. I have made two attempts:
First attempt: read through each file individually (~70 files). This successfully imported 11,171,885/ 22,343,770 documents.
import json
import glob
from pymongo import MongoClient
directory = '/data/twitter/output/*.json'
client = MongoClient("localhost", 27017)
db = client.twitter
collection = db.test
jsonFiles = glob.glob(directory)
for file in jsonFiles:
f = open(file, 'r')
for line in f.read().split("\n"):
if line:
try:
lineJson = json.loads(line)
except (ValueError, KeyError, TypeError) as e:
pass
else:
postid = collection.insert(lineJson)
print 'inserted with id: ' , postid
f.close()
Second attempt: Concatenate each .JSON file into one large file. This successfully import 11,171,879/ 22,343,770 documents.
import json
import os
from pymongo import MongoClient
import sys
client = MongoClient("localhost", 27017)
db = client.tweets
collection = db.test
script_dir = os.path.dirname(__file__)
file_path = os.path.join(script_dir, '/data/twitter/blob/historical-tweets.json')
try:
with open(file_path, 'r') as f:
for line in f.read().split("\n"):
if line:
try:
lineJson = json.loads(line)
except (ValueError, KeyError, TypeError) as e:
pass
else:
postid = collection.insert(lineJson)
print 'inserted with id: ' , postid
f.close()
The python script did not error out and output a traceback, it simply stopped running. Any ideas to what could be causing this? Or any alternative solutions to importing the data more efficiently? Thanks in advance.
You are reading the file one line at the time. Is each line really valid json? If not, json.loads will trace and you are hiding that trace with the pass statement.

Categories

Resources