Handling Unicode string pulled from SOQL In Python - python

The purpose of the code is to use SOQL to query the SalesForce API, then to format the data and do some stuff before putting putting it into an oracle database. My code successfully handles the first and third part but the second part keeps breaking.
The code is using Python 2.7 with the standard C python compiler on Windows 7.
The SOQL is
SELECT ID, Name, Type, Description, StartDate, EndDate, Status
FROM CAMPAIGN
ORDER BY ID
This query pulls back a few hundred results in a JSON Dict.
I have to pull each record (Record contains ID, Name, Type, Description, StartDate, EndDate, and Status) one at a time and pass them to a function that generates the proper SQL to put the data in the proper Oracle Database. All of the results of the query come back as Unicode strings.
After I query the data and try to pass it to the function to generate the SQL to insert it into the Oracle database is where the trouble shows up.
Here is the section of code where the error occurs.
keys = ['attributes', 'Id', 'Name', 'Type', 'Description', 'StartDate', 'EndDate', 'Status']
for record in SrcData['records']: #Data cleaning in this loop.
processedRecs = []
if record['Description'] is not None:
record['Description'] = encodeStr(record['Description'])
record['Description'] = record['Description'][0:253]
for key in keys:
if key == 'attributes':
continue
elif key == 'StartDate' and record[key] is not None:
record[key] = datetime.datetime.strptime(record[key], "%Y-%m-%d")
elif key == 'EndDate' and record[key] is not None:
record[key] = datetime.datetime.strptime(record[key], "%Y-%m-%d")
else:
pass
processedRecs.append(record[key])
sqlFile.seek(0)
Query = RetrieveSQL(sqlFile, processedRecs)
The key list is because there was issues with looping on SrcData.keys().
the encodeStr function is:
def encodeStr(strToEncode):
if strToEncode == None:
return ""
else:
try:
tmpstr = strToEncode.encode('ascii', 'ignore')
tmpstr = ' '.join(tmpstr.split())
return tmpstr
except:
return str(strToEncode)
The error message I get is:
Traceback (most recent call last): File "XXX", line 106, in Query = ASPythonLib.RetrieveSQL(sqlFile, processedRecs), UnicodeEncodeError: ascii codec cant encode character u\u2026 in position 31: ordinal not in range(128)
the XXXX is just a file path to where this code is in our file system. Boss said I must remove the path.
I have also tried multiple variation of:
record['Description'] = record['Description'].encode('ascii', 'ignore').decode(encoding='ascii',errors='strict')
I have tried swapping the order of the encode and decode functions. I have tried different codecs and different error handling schemes.
****Edit**** This code works correct in like 20 other cycles so it's safe to assume the error is not in the RetrieveSQL().
Here is the code for RetrieveSQL:
def RetrieveSQL(SQLFile, VarList, Log = None):
SQLQuery = SQLFile.readline()
FileArgs = [""]
NumArgValues = len(VarList)
if( "{}" in SQLQuery ):
# NumFileArgs == 0
if (NumArgValues != 0):
print "Number of File Arguments is zero for File " + str(SQLFile) + " is NOT equal to the number of values provided per argument (" + str(NumArgValues) + ")."
return SQLFile.read()
elif( SQLQuery[0] != "{" ):
print "File " + str(SQLFile) + " is not an SQL source file."
return -1
elif( SQLQuery.startswith("{") ):
FileArgs = SQLQuery.replace("{", "").replace("}", "").split(", ")
for Arg in xrange(0, len(FileArgs)):
FileArgs[Arg] = "&" + FileArgs[Arg].replace("\n", "").replace("\t", "") + "&" # Add &'s for replacing
NumFileArgs = len(FileArgs)
if (NumFileArgs != NumArgValues):
if (NumArgValues == 0):
print "No values were supplied to RetrieveSQL() for File " + str(SQLFile) + " when there were supposed to be " + str(NumFileArgs) + " values."
return -1
elif (NumArgValues > 0):
"Number of File Arguments (" + str(NumFileArgs) + ") for File " + str(SQLFile) + " is NOT equal to the number of values provided per argument (" + str(NumArgValues) + ")."
return -1
SQLQuery = SQLFile.read()
VarList = list(VarList)
for Arg in xrange(0, len(FileArgs)):
if (VarList[Arg] == None):
SQLQuery = SQLQuery.replace(FileArgs[Arg], "NULL")
elif ("'" in str(VarList[Arg])):
SQLQuery = SQLQuery.replace(FileArgs[Arg], "'" + VarList[Arg].replace("'", "''") + "'")
elif ("&" in str(VarList[Arg])):
SQLQuery = SQLQuery.replace(FileArgs[Arg], "'" + VarList[Arg].replace("&", "&'||'") + "'")
elif (isinstance(VarList[Arg], basestring) == True):
VarList[Arg] = VarList[Arg].replace("'", "''")
SQLQuery = SQLQuery.replace(FileArgs[Arg], "'" + VarList[Arg] + "'")
else:
SQLQuery = SQLQuery.replace(FileArgs[Arg], str(VarList[Arg]))
SQLFile.seek(0)
return SQLQuery
****Edit #2 ****
Tried finding a complete traceback in logging files but the logging system for this script is terrible and never logs more than 'Cycle success' or 'Cycle Fail'. Ahh the fun of rewriting code written by people who don't know how to code.

Related

Confused by a python type errror

I've been using python for a little while and have made some improvements but this a new error to me. I'm trying to learn social media analysis for my career and that's why I am trying out this set of code here.
I've de bugged one error but this one, which appears at line 81, has got me stumped as I can't see why the function "def get_user_objects(follower_ids):" returns none and what i'd need to change it in accordance with previous advice on other questions here.
Here's script to that point for simplicity. All help appreciated.
The error, to repeat is TypeError: object of type 'NoneType' has no len()
from tweepy import OAuthHandler
from tweepy import API
from collections import Counter
from datetime import datetime, date, time, timedelta
import sys
import json
import os
import io
import re
import time
# Helper functions to load and save intermediate steps
def save_json(variable, filename):
with io.open(filename, "w", encoding="utf-8") as f:
f.write(str(json.dumps(variable, indent=4, ensure_ascii=False)))
def load_json(filename):
ret = None
if os.path.exists(filename):
try:
with io.open(filename, "r", encoding="utf-8") as f:
ret = json.load(f)
except:
pass
return ret
def try_load_or_process(filename, processor_fn, function_arg):
load_fn = None
save_fn = None
if filename.endswith("json"):
load_fn = load_json
save_fn = save_json
else:
load_fn = load_bin
save_fn = save_bin
if os.path.exists(filename):
print("Loading " + filename)
return load_fn(filename)
else:
ret = processor_fn(function_arg)
print("Saving " + filename)
save_fn(ret, filename)
return ret
# Some helper functions to convert between different time formats and
perform date calculations
def twitter_time_to_object(time_string):
twitter_format = "%a %b %d %H:%M:%S %Y"
match_expression = "^(.+)\s(\+[0-9][0-9][0-9][0-9])\s([0-9][0-9][0-9]
[09])$"
match = re.search(match_expression, time_string)
if match is not None:
first_bit = match.group(1)
second_bit = match.group(2)
last_bit = match.group(3)
new_string = first_bit + " " + last_bit
date_object = datetime.strptime(new_string, twitter_format)
return date_object
def time_object_to_unix(time_object):
return int(time_object.strftime("%s"))
def twitter_time_to_unix(time_string):
return time_object_to_unix(twitter_time_to_object(time_string))
def seconds_since_twitter_time(time_string):
input_time_unix = int(twitter_time_to_unix(time_string))
current_time_unix = int(get_utc_unix_time())
return current_time_unix - input_time_unix
def get_utc_unix_time():
dts = datetime.utcnow()
return time.mktime(dts.timetuple())
# Get a list of follower ids for the target account
def get_follower_ids(target):
return auth_api.followers_ids(target)
# Twitter API allows us to batch query 100 accounts at a time
# So we'll create batches of 100 follower ids and gather Twitter User
objects for each batch
def get_user_objects(follower_ids):
batch_len = 100
num_batches = len(follower_ids)/100
batches = (follower_ids[i:i+batch_len] for i in range(0,
len(follower_ids), batch_len))
all_data = []
for batch_count, batch in enumerate(batches):
sys.stdout.write("\r")
sys.stdout.flush()
sys.stdout.write("Fetching batch: " + str(batch_count) + "/" +
str(num_batches))
sys.stdout.flush()
users_list = auth_api.lookup_users(user_ids=batch)
users_json = (map(lambda t: t._json, users_list))
all_data += users_json
return all_data
# Creates one week length ranges and finds items that fit into those range
boundaries
def make_ranges(user_data, num_ranges=20):
range_max = 604800 * num_ranges
range_step = range_max/num_ranges
# We create ranges and labels first and then iterate these when going
through the whole list
# of user data, to speed things up
ranges = {}
labels = {}
for x in range(num_ranges):
start_range = x * range_step
end_range = x * range_step + range_step
label = "%02d" % x + " - " + "%02d" % (x+1) + " weeks"
labels[label] = []
ranges[label] = {}
ranges[label]["start"] = start_range
ranges[label]["end"] = end_range
for user in user_data:
if "created_at" in user:
account_age = seconds_since_twitter_time(user["created_at"])
for label, timestamps in ranges.iteritems():
if account_age > timestamps["start"] and account_age <
timestamps["end"]:
entry = {}
id_str = user["id_str"]
entry[id_str] = {}
fields = ["screen_name", "name", "created_at",
"friends_count", "followers_count", "favourites_count", "statuses_count"]
for f in fields:
if f in user:
entry[id_str][f] = user[f]
labels[label].append(entry)
return labels
if __name__ == "__main__":
account_list = []
if (len(sys.argv) > 1):
account_list = sys.argv[1:]
if len(account_list) < 1:
print("No parameters supplied. Exiting.")
sys.exit(0)
consumer_key="XXXXXXX"
consumer_secret="XXXXXX"
access_token="XXXXXXX"
access_token_secret="XXXXXXXX"
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
auth_api = API(auth)
for target in account_list:
print("Processing target: " + target)
# Get a list of Twitter ids for followers of target account and save it
filename = target + "_follower_ids.json"
follower_ids = try_load_or_process(filename, get_follower_ids,
target)
# Fetch Twitter User objects from each Twitter id found and save the data
filename = target + "_followers.json"
user_objects = try_load_or_process(filename, get_user_objects,
follower_ids)
total_objects = len(user_objects)
# Record a few details about each account that falls between specified age
ranges
ranges = make_ranges(user_objects)
filename = target + "_ranges.json"
save_json(ranges, filename)
# Print a few summaries
print
print("\t\tFollower age ranges")
print("\t\t===================")
total = 0
following_counter = Counter()
for label, entries in sorted(ranges.iteritems()):
print("\t\t" + str(len(entries)) + " accounts were created
within " + label)
total += len(entries)
for entry in entries:
for id_str, values in entry.iteritems():
if "friends_count" in values:
following_counter[values["friends_count"]] += 1
print("\t\tTotal: " + str(total) + "/" + str(total_objects))
print
print("\t\tMost common friends counts")
print("\t\t==========================")
total = 0
for num, count in following_counter.most_common(20):
total += count
print("\t\t" + str(count) + " accounts are following " +
str(num) + " accounts")
print("\t\tTotal: " + str(total) + "/" + str(total_objects))
print
print
The immediate problem is in load_json: you assume its return value is a list or dict, or something that can be passed to len. However, it can return None in a number of circumstances:
The file to read from isn't found
There is some error reading from the file
There is a problem decoding the contents of the file
The file contains just the JSON value null.
At no point after you call load_json do you check its return value.
Worse, you catch and ignore any exception that might occur in load_json, causing it to silently return None with no indication that something went wrong.
The function would be better written like
def load_json(filename):
with io.open(filename, "r", encoding="utf-8") as f:
return json.load(f)
At least now, any errors will raise an uncaught exception, making it more obvious that there was a problem and providing a clue as to what the problem was. The golden rule of exception handling is to only catch the exceptions you can do something about, and if you can't do anything about a caught exception, re-raise it.
You could check for the resultant value and follow accordingly:
# Fetch Twitter User objects from each Twitter id found and save the data
filename = target + "_followers.json"
res_get_user_objects = get_user_objects()
if res_get_user_objects is not None:
user_objects = try_load_or_process(filename, get_user_objects,
follower_ids)
total_objects = len(user_objects)
else:
# handle it otherwise

Data gets mixed up while trying to transfer it to arangodb

I'm trying to transfer ca. 10GB of json data (tweets in my case) to a collection in arangodb. I'm also trying to use joblib for it:
from ArangoConn import ArangoConn
import Userdata as U
import encodings
from joblib import Parallel,delayed
import json
from glob import glob
import time
def progress(total, prog, start, stri = ""):
if(prog == 0):
print("")
prog = 1;
perc = prog / total
diff = time.time() - start
rem = (diff / prog) * (total - prog)
bar = ""
for i in range(0,int(perc*20)):
bar = bar + "|"
for i in range(int(perc*20),20):
bar = bar + " "
print("\r"+"progress: " + "[" + bar + "] " + str(prog) + " of " +
str(total) + ": {0:.1f}% ".format(perc * 100) + "- " +
time.strftime("%H:%M:%S", time.gmtime(rem)) + " " + stri, end="")
def processfile(filepath):
file = open(filepath,encoding='utf-8')
s = file.read()
file.close()
data = json.loads(s)
Parallel(n_jobs=12, verbose=0, backend="threading"
(map(delayed(ArangoConn.createDocFromObject), data))
files = glob(U.path+'/*.json')
i = 1
j = len(files)
starttime = time.time()
for f in files:
progress(j,i,starttime,f)
i = i+1
processfile(f)
and
from pyArango.connection import Connection
import Userdata as U
import time
class ArangoConn:
def __init__(self,server,user,pw,db,collectionname):
self.server = server
self.user = user
self.pw = pw
self.db = db
self.collectionname = collectionname
self.connection = None
self.dbHandle = self.connect()
if not self.dbHandle.hasCollection(name=self.collectionname):
coll = self.dbHandle.createCollection(name=collectionname)
else:
coll = self.dbHandle.collections[collectionname]
self.collection = coll
def db_createDocFromObject(self, obj):
data = obj.__dict__()
doc = self.collection.createDocument()
for key,value in data.items():
doc[key] = value
doc._key= str(int(round(time.time() * 1000)))
doc.save()
def connect(self):
self.connection = Connection(arangoURL=self.server + ":8529",
username=self.user, password=self.pw)
if not self.connection.hasDatabase(self.db):
db = self.connection.createDatabase(name=self.db)
else:
db = self.connection.databases.get(self.db)
return db
def disconnect(self):
self.connection.disconnectSession()
def getAllData(self):
docs = []
for doc in self.collection.fetchAll():
docs.append(self.doc_to_result(doc))
return docs
def addData(self,obj):
self.db_createDocFromObject(obj)
def search(self,collection,search,prop):
docs = []
aql = """FOR q IN """+collection+""" FILTER q."""+prop+""" LIKE
"%"""+search+"""%" RETURN q"""
results = self.dbHandle.AQLQuery(aql, rawResults=False, batchSize=1)
for doc in results:
docs.append(self.doc_to_result(doc))
return docs
def doc_to_result(self,arangodoc):
modstore = arangodoc.getStore()
modstore["_key"] = arangodoc._key
return modstore
def db_createDocFromJson(self,json):
for d in json:
doc = self.collection.createDocument()
for key,value in d.items():
doc[key] = value
doc._key = str(int(round(time.time() * 1000)))
doc.save()
#staticmethod
def createDocFromObject(obj):
c = ArangoConn(U.url, U.user, U.pw, U.db, U.collection)
data = obj
doc = c.collection.createDocument()
for key, value in data.items():
doc[key] = value
doc._key = doc["id"]
doc.save()
c.connection.disconnectSession()
It kinda works like that. My problem is that the data that lands in the database is somehow mixed up.
as you can see in the screenshot "id" and "id_str" are not the same - as they should be.
what i investigated so far:
I thought that at some points the default keys in the databese may "collide"
because of the threading so I set the key to the tweet id.
I tried to do it without multiple threads. the threading doesn't seem to be
the problem
I looked at the data I send to the database... everything seems to be fine
But as soon as I communicate with the db the data mixes up.
My professor thought that maybe something in pyarango isn't threadsafe and it messes up the data but I don't think so as threading doesn't seem to be the problem.
I have no ideas left where this behavior could come from...
Any ideas?
The screenshot shows the following values:
id : 892886691937214500
id_str : 892886691937214465
It looks like somewhere along the way the value is converted to an IEEE754 double, which cannot safely represent the latter value. So there is potentially some precision loss due to conversion.
A quick example in node.js (JavaScript is using IEEE754 doubles for any number values greater than 0xffffffff) shows that this is likely the problem cause:
$ node
> 892886691937214500
892886691937214500
> 892886691937214465
892886691937214500
So the question is where the conversion does happen. Can you check whether the python client program is correctly sending the expected values to ArangoDB, or does it already send the converted/truncated values?
In general, any integer number that exceeds 0x7fffffffffffffff will be truncated when stored in ArangoDB, or converted to an IEEE754 double. This can be avoided by storing the number values inside a string, but of course comparing two number strings will produce different results than comparing two numbers (e.g. "10" < "9" vs. 10 > 9).

Parsing JSON in Python

I have below query stored in a variable I got and I need to fetch value of 'resource_status'.
I need 'UPDATE_IN_PROGRESS'
As requested, putting the code here. The variable evntsdata is storing the events list.
try:
evntsdata = str(hc.events.list(stack_name)[0]).split(" ") # this is the variable that is getting the JSON response (or so)
#print(evntsdata[715:733])
#event_handle = evntsdata[715:733]
if event_handle == 'UPDATE_IN_PROGRESS':
loopcontinue = True
while loopcontinue:
evntsdata = str(hc.events.list(stack_name)[0]).split(" ")
#event_handle = evntsdata[715:733]
if (event_handle == 'UPDATE_COMPLETE'):
loopcontinue = False
print(str(timestamp()) + " " + "Stack Update is Completed!" + ' - ' + evntsdata[-3] + ' = ' + evntsdata[-1])
else:
print(str(timestamp()) + " " + "Stack Update in Progress!" + ' - ' + evntsdata[-3] + ' = ' + evntsdata[-1])
time.sleep(10)
else:
print("No updates to perform")
exit(0)
except AttributeError as e:
print(str(timestamp()) + " " + "ERROR: Stack Update Failure")
raise
print(evntsdata) has below result
['<Event', "{'resource_name':", "'Stackstack1',", "'event_time':", "'2017-05-26T12:10:43',", "'links':", "[{'href':", "'x',", "'rel':", "'self'},", "{'href':", "'x',", "'rel':", "'resource'},", "{'href':", "'x',", "'rel':", "'stack'}],", "'logical_resource_id':", "'Stackstack1',", "'resource_status':", "'UPDATE_IN_PROGRESS',", "'resource_status_reason':", "'Stack", 'UPDATE', "started',", "'physical_resource_id':", "'xxx',", "'id':", "'xxx'}>"]
Do not serialize and parse objects when the data is in front of you. This is inefficient and hard to understand and maintain. The solution is quite trivial:
data = hc.events.list(stack_name)[0].to_dict()
event_handle = data['resource_status']
It's not JSON, it's a class that you've printed
class Event(base.Resource):
def __repr__(self):
return "<Event %s>" % self._info
Try poking around the source code to get access to the dictionary self._info, then access your fields according
For example,
event_info = hc.events.list(stack_name)[0]._info
event_handle = event_info['resource_status']
Though, there may be another way like calling to_dict() instead, since the underscore indicates a private variable

Python, MySQL regression, SQL bug or faulty conditionals?

I have a bucket folder that contains csv files of the form yy-mm-dd.CSV with several rows of header I can ignore apart from the date at the end of the second row, and then 151 rows of timestamp:power(kW). Here's a snippet:
sep=;
Version CSV|Tool SunnyBeam11|Linebreaks CR/LF|Delimiter semicolon|Decimalpoint point|Precision 3|Language en-UK|TZO=0|DST|2012.06.21
;SN: removed
;SB removed
;2120138796
Time;Power
HH:mm;kW
00:10;0.000
00:20;0.000
00:30;0.000
00:40;0.000
00:50;0.000
01:00;0.000
01:10;0.000
01:20;0.000
01:30;0.000
01:40;0.000
01:50;0.000
02:00;0.000
02:10;0.000
02:20;0.000
02:30;0.000
02:40;0.000
02:50;0.000
03:00;0.000
03:10;0.000
03:20;0.000
03:30;0.000
03:40;0.000
03:50;0.000
04:00;0.000
04:10;0.000
04:20;0.000
04:30;0.000
04:40;0.000
04:50;0.006
05:00;0.024
05:10;0.006
05:20;0.000
05:30;0.030
05:40;0.036
05:50;0.042
06:00;0.042
06:10;0.042
06:20;0.048
06:30;0.060
06:40;0.114
06:50;0.132
07:00;0.150
I parse the bucket folder for these files checking that they have this format filename, as there are other files I don't want to parse, and I grab the date from row two of each file and store it. I connect to the database and then work down the remaining lines, concatenating the stored date with the timestamp on each line after row 9 (or thereabouts). I also grab the second value on each line (power, in kW). The intention is to insert the concatenated date-time value and associated power value into the connected mysql database. When the last line is read, the file is moved to a subfolder, called 'parsed'. All of this proceeds as expected but every row read goes through the except branch of the try/except loop (Line 107) that prints 'cannot append to Db'. I've checked the stored database credentails work by logging in to MySQL (actually MariaDB on OpenSuse LEAP 4.2) and that works and I've printed the connection variable, both of which lead me to believe that I am actually connected properly for each file. I would snip out parts of my Python script to make it shorter but I'm not a particuarly advanced Python coder and I don't want to risk missing the key part:
#!/usr/bin/python
from os import listdir
from datetime import datetime
import MySQLdb
import shutil
import syslog
#from sys import argv
def is_dated_csv(filename):
"""
Return True if filename matches format YY-MM-DD.csv, otherwise False.
"""
date_format = '%y-%m-%d.csv'
try:
date = datetime.strptime(filename, date_format)
return True
except ValueError:
# filename did not match pattern
syslog.syslog('SunnyData file ' + filename + ' did NOT match')
#print filename + ' did NOT match'
pass
#'return' terminates a function
return False
def parse_for_date(filename):
"""
Read file for the date - from line 2 field 10
"""
currentFile = open(filename,'r')
l1 = currentFile.readline() #ignore first line read
date_line = currentFile.readline() #read second line
dateLineArray = date_line.split("|")
day_in_question = dateLineArray[-1]#save the last element (date)
currentFile.close()
return day_in_question
def normalise_date_to_UTF(day_in_question):
"""
Rather wierdly, some days use YYYY.MM.DD format & others use DD/MM/YYYY
This function normalises either to UTC with a blank time (midnight)
"""
if '.' in day_in_question: #it's YYYY.MM.DD
dateArray = day_in_question.split(".")
dt = (dateArray[0] +dateArray[1] + dateArray[2].rstrip() + '000000')
elif '/' in day_in_question: #it's DD/MM/YYYY
dateArray = day_in_question.split("/")
dt = (dateArray[2].rstrip() + dateArray[1] + dateArray[0] + '000000')
theDate = datetime.strptime(dt,'%Y%m%d%H%M%S')
return theDate #A datetime object
def parse_power_values(filename, theDate):
currentFile = open(filename,'r')
for i, line in enumerate(currentFile):
if i <= 7:
doingSomething = True
print 'header' + str(i) + '/ ' + line.rstrip()
elif ((i > 7) and (i <= 151)):
lineParts = line.split(';')
theTime = lineParts[0].split(':')
theHour = theTime[0]
theMin = theTime[1]
timestamp = theDate.replace(hour=int(theHour),minute=int(theMin))
power = lineParts[1].rstrip()
if power == '-.---':
power = 0.000
if (float(power) > 0):
print str(i) + '/ ' + str(timestamp) + ' power = ' + power + 'kWh'
append_to_database(timestamp,power)
else:
print str(i) + '/ '
elif i > 151:
print str(timestamp) + ' DONE!'
print '----------------------'
break
currentFile.close()
def append_to_database(timestampval,powerval):
host="localhost", # host
user="removed", # username
#passwd="******"
passwd="removed"
database_name = 'SunnyData'
table_name = 'DTP'
timestamp_column = 'DT'
power_column = 'PWR'
#sqlInsert = ("INSERT INTO %s (%s,%s) VALUES('%s','%s')" % (table_name, timestamp_column, power_column, timestampval.strftime('%Y-%m-%d %H:%M:%S'), powerval) )
#sqlCheck = ("SELECT TOP 1 %s.%s FROM %s WHERE %s.%s = %s;" % (table_name, timestamp_column, table_name, table_name, timestamp_column, timestampval.strftime('%Y-%m-%d %H:%M:%S')) )
sqlInsert = ("INSERT INTO %s (%s,%s) VALUES('%s','%s')", (table_name, timestamp_column, power_column, timestampval.strftime('%Y-%m-%d %H:%M:%S'), powerval) )
sqlCheck = ("SELECT TOP 1 %s.%s FROM %s WHERE %s.%s = %s;", (table_name, timestamp_column, table_name, table_name, timestamp_column, timestampval.strftime('%Y-%m-%d %H:%M:%S')) )
cur = SD.cursor()
try:
#cur.execute(sqlCheck)
# Aim here is to see if the datetime for the file has an existing entry in the database_name
#If it does, do nothing, otherwise add the values to the datbase
cur.execute(sqlCheck)
if cur.fetchone() == "None":
cur.execute(sqlInsert)
print ""
SD.commit()
except:
print 'DB append failed!'
syslog.syslog('SunnyData DB append failed')
SD.rollback()
# Main start of program
path = '/home/greg/currentGenerated/SBEAM/'
destination = path + '/parsed'
syslog.syslog('parsing SunnyData CSVs started')
for filename in listdir(path):
print filename
if is_dated_csv(filename):
#connect and disconnect once per CSV file - wasteful to reconnect for every line in def append_to_database(...)
SD = MySQLdb.connect(host="localhost", user="root",passwd="removed", db = 'SunnyData')
print SD
print filename + ' matched'
day_in_question = parse_for_date(filename)
print 'the date is ' + day_in_question
theDate = normalise_date_to_UTF(day_in_question)
parse_power_values(filename, theDate)
SD.close()
shutil.move(path + '/' + filename, destination)
syslog.syslog('SunnyData file' + path + '/' + filename + 'parsed & moved to ' + destination)
It used to work but it's been a long time and many updates since I last checked it. I worry that a regression may have changed something under my code. Just not sure how to work it all out.
Apologies that this isn't a very crisp and specific question but if you can help me sort it, it may still serve as a good example for others?
Thanks
Greg
There is no SELECT TOP ... syntax in MySQL/MariaDB, so your script must be failing upon trying to execute sqlCheck.
It should be SELECT %s.%s FROM %s WHERE %s.%s = %s LIMIT 1 instead.

How to escape table names in SqlAlchemy

I'm working on a SQLAlchemy dialect for Apache Drill and I've run into an issue that I can't quite seem to figure out.
The basic problem is that SQLAlchemy is generating a query like the one below:
SELECT `field1`, `field2`
FROM dfs.test.data.csv LIMIT 100
which fails because data.csv needs backticks around it as shown below:
SELECT `field1`, `field2`
FROM dfs.test.`data.csv` LIMIT 100
I've defined the various visit_() functions in the dialect's compiler but these seem to have no effect.
This took some time to figure out, and I thought I'd post the result so that if anyone else runs into this issue, they'll have a point of reference as to how to solve it.
Here is the final working code:
https://github.com/JohnOmernik/sqlalchemy-drill/blob/master/sqlalchemy_drill/base.py
Here is what ultimately solved the issue:
def __init__(self, dialect):
super(DrillIdentifierPreparer, self).__init__(dialect, initial_quote='`', final_quote='`')
def format_drill_table(self, schema, isFile=True):
formatted_schema = ""
num_dots = schema.count(".")
schema = schema.replace('`', '')
# For a file, the last section will be the file extension
schema_parts = schema.split('.')
if isFile and num_dots == 3:
# Case for File + Workspace
plugin = schema_parts[0]
workspace = schema_parts[1]
table = schema_parts[2] + "." + schema_parts[3]
formatted_schema = plugin + ".`" + workspace + "`.`" + table + "`"
elif isFile and num_dots == 2:
# Case for file and no workspace
plugin = schema_parts[0]
formatted_schema = plugin + "." + schema_parts[1] + ".`" + schema_parts[2] + "`"
else:
# Case for non-file plugins or incomplete schema parts
for part in schema_parts:
quoted_part = "`" + part + "`"
if len(formatted_schema) > 0:
formatted_schema += "." + quoted_part
else:
formatted_schema = quoted_part
return formatted_schema

Categories

Resources