SnapLogic Python Read and Execute SQL File - python

I have a simple SQL file that I'd like to read and execute using a Python Script Snap in SnapLogic. I created an expression library file to reference the Redshift account and have included it as a parameter in the pipeline.
I have the code below from another post. Is there a way to reference the pipeline parameter to connect to the Redshift database, read the uploaded SQL file and execute the commands?
fd = open('shared/PythonExecuteTest.sql', 'r')
sqlFile = fd.read()
fd.close()
sqlCommands = sqlFile.split(';')
for command in sqlCommands:
try:
c.execute(command)
except OperationalError, msg:
print "Command skipped: ", msg

You can access pipeline parameters in scripts using $_.
Let's say, you have a pipeline parameter executionId. Then to access it in the script you can do $_executionId.
Following is a test pipeline.
With the following pipeline parameter.
Following is the test data.
Following is the script
# Import the interface required by the Script snap.
from com.snaplogic.scripting.language import ScriptHook
import java.util
class TransformScript(ScriptHook):
def __init__(self, input, output, error, log):
self.input = input
self.output = output
self.error = error
self.log = log
# The "execute()" method is called once when the pipeline is started
# and allowed to process its inputs or just send data to its outputs.
def execute(self):
self.log.info("Executing Transform script")
while self.input.hasNext():
try:
# Read the next document, wrap it in a map and write out the wrapper
in_doc = self.input.next()
wrapper = java.util.HashMap()
wrapper['output'] = in_doc
wrapper['output']['executionId'] = $_executionId
self.output.write(in_doc, wrapper)
except Exception as e:
errWrapper = {
'errMsg' : str(e.args)
}
self.log.error("Error in python script")
self.error.write(errWrapper)
self.log.info("Finished executing the Transform script")
# The Script Snap will look for a ScriptHook object in the "hook"
# variable. The snap will then call the hook's "execute" method.
hook = TransformScript(input, output, error, log)
Output:
Here, you can see that the executionId was read from the pipeline parameters.
Note: Accessing pipeline parameters from scripts is a valid scenario but accessing other external systems from the script is complicated (because you would need to load the required libraries) and not recommended. Use the snaps provided by SnapLogic to access external systems. Also, if you want to use other libraries inside scripts, try sticking to Javascript instead of going to python because there are a lot of open source CDNs that you can use in your scripts.
Also, you can't access any configured expression library directly from the script. If you need some logic in the script, you would keep it in the script and not somewhere else. And, there is no point in accessing account names in the script (or mappers) because, even if you know the account name, you can't use the credentials/configurations stored in that account directly; that is handled by SnapLogic. Use the provided snaps and mappers as much as possible.
Update #1
You can't access the account directly. Accounts are managed and used internally by the snaps. You can only create and set accounts through the accounts tab of the relevant snap.
Avoid using script snap as much as possible; especially, if you can do the same thing using normal snaps.
Update #2
The simplest solution to this requirement would be as follows.
Read the file using a file reader
Split based on ;
Execute each SQL command using the Generic JDBC Execute Snap

Related

Import Excel file into MSSQL via Python, using an SQL Agent job

Task specs:
Import Excel file(s) into MSSQL database(s) using Python, but in a parametrized manner, and using SQL Server Agent job(s).
With the added requirement to set parameter values and/or run the job steps from SQL (query or SP).
And without using Access Database Engine(s) and/or any code that makes use of such drivers (in any wrapping).
First. Let's get some preparatory stuff out of the way.
We will need to set some PowerShell settings.
Run windows PowerShell as Administrator and do:
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
Second. Some assumptions for reasons of clarity.
And those are:
1a. You have at least one instance of SQL2017 or later (Developer / Enterprise / Standard edition) installed and running on your machine.
1b. You have not bootstrapped the installation of this SQL instance so as to exclude Integration Services (SSIS).
1c. There exists SQL Server Agent running, bound to this SQL instance.
1d. You have some SSMS installed.
2a. There is at least one database attached to this instance (if not create one – please refrain from using in-memory filegroups for this exercise, I have not tested on those).
2b. There are no database level DML triggers that log all data changes in a designated table.
3. There is no active Server Audit Specification for this database logging everything we do.
4. Replication is not enabled (I mean the proper MSSQL replication feature not like scripts by 3rd party apps).
For 2b and 3 it's just cause I have not tested this with those on, but for number 4 it defo won't work with that on.
5. You are windows authenticated into the chosen SQL instance and your instance login and db mappings and privileges are sufficient for at least table creation and basic stuff.
Third.
We are going to need some kind of Python script to do this right?
Ok let's make one.
import pandas as pd
import sqlalchemy as sa
import urllib
import sys
import warnings
import os
import re
import time
#COMMAND LINE PARAMETERS
server = sys.argv[1]
database = sys.argv[2]
ExcelFileHolder = sys.argv[3]
SQLTableName = sys.argv[4]
#END OF COMMAND LINE PARAMETERS
excel_sheet_number_left_to_right = 0
warnings.filterwarnings('ignore')
driver = "SQL Server Native Client 11.0"
params = "DRIVER={%s};SERVER=%s;DATABASE=%s;Trusted_Connection=yes;QuotedID=Yes;" % (driver, server, database) #added the explicit "QuotedID=Yes;" to ensure no issues with column names
params = urllib.parse.quote_plus(params) #urllib.parse.quote_plus for Python 3
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s?charset=utf8" % params) #charset is cool to have here
conn = engine.connect()
def execute_sql_trans(sql_string, log_entry):
with conn.begin() as trans:
result = conn.execute(sql_string)
if len(log_entry) >= 1:
log.write(log_entry + "\n")
return result
excelfilesCursor = {}
def process_excel_file(excelfile, excel_sheet_name, tableName, withPyIndexOrSQLIndex, orderByCandidateFields):
withPyIndexOrSQLIndex = 0
excelfilesCursor.update({tableName: withPyIndexOrSQLIndex})
df = pd.read_excel(open(excelfile,'rb'), sheet_name=excel_sheet_name)
now = time.time()
mlsec = repr(now).split('.')[1][:3]
log_string = "Reading file \"" + excelfile + "\" to memory: " + str(time.strftime("%Y-%m-%d %H:%M:%S.{} %Z".format(mlsec), time.localtime(now))) + "\n"
print(log_string)
df.to_sql(tableName, engine, if_exists='replace', index_label='index.py')
now = time.time()
mlsec = repr(now).split('.')[1][:3]
log_string = "Writing file \"" + excelfile + "\", sheet " +str(excel_sheet_name)+ " to SQL instance " +server+ ", into ["+database+"].[dbo].["+tableName+"]: " + str(time.strftime("%Y-%m-%d %H:%M:%S.{} %Z".format(mlsec), time.localtime(now))) + "\n"
print(log_string)
def convert_datetimes_to_dates(tableNameParam):
sql_string = "exec [convert_datetimes_to_dates] '"+tableNameParam+"';"
execute_sql_trans(sql_string, "")
process_excel_file(ExcelFileHolder, excel_sheet_number_left_to_right, SQLTableName, 0, None)
sys.exit(0)
Ok you may or may not notice that my script contains some extra defs, I sometimes use them for convenience you may as well ignore them.
Save the python script somewhere nice say C:\PythonWorkspace\ExcelPythonToSQL.py
Also, needless to mention that you will need some py modules in your venv. The ones you don't already have you need to pip install them obviously.
Fourth.
Connect to your db, SSMS, etc. and create a new Agent job.
Let's call it "ExcelPythonToSQL".
New step, let's call it "PowerShell parametrized script".
Set the Type to PowerShell.
And place this code inside it:
$pyFile="C:\PythonWorkspace\ExcelPythonToSQL.py"
$SQLInstance="SomeMachineName\SomeNamedSQLInstance"
#or . or just the computername or localhost if your SQL instance is a default instance i.e. not a named one.
$dbName="SomeDatabase"
$ExcelFileFullPath="C:\Temp\ExampleExcelFile.xlsx"
$targetTableName="somenewtable"
C:\ProgramData\Miniconda3\envs\YOURVENVNAMEHERE\python $pyFile $SQLInstance $dbName $ExcelFileFullPath $targetTableName
Save the step and the job.
Now let's wrap it around something easier to handle. Because remember, this job and step is not like an SSIS step where you could potentially alter the parameter values in its configuration tab. You don't want to properties the job and the step each time and specify different excel file or target table.
So.
Ah also, do me a solid and do this little trick. Do a small alteration in the code, anything and then instead of OK do a Script to New Query Window. That way we can capture the guid of the job without having to query for it.
So now.
Create a SP like so:
use [YourDatabase];
GO
create proc [ExcelPythonToSQL_amend_job_step_params]( #pyFile nvarchar(max),
#SQLInstance nvarchar(max),
#dbName nvarchar(max),
#ExcelFileFullPath nvarchar(max),
#targetTableName nvarchar(max)='somenewtable'
)
as
begin
declare #sql nvarchar(max);
set #sql = '
exec msdb.dbo.sp_update_jobstep #job_id=N''7f6ff378-56cd-4a8d-ba40-e9057439a5bc'', #step_id=1,
#command=N''
$pyFile="'+#pyFile+'"
$SQLInstance="'+#SQLInstance+'"
$dbName="'+#dbName+'"
$ExcelFileFullPath="'+#ExcelFileFullPath+'"
$targetTableName="'+#targetTableName+'"
C:\ProgramData\Miniconda3\envs\YOURVENVGOESHERE\python $pyFile $SQLInstance $dbName $ExcelFileFullPath $targetTableName''
';
--print #sql;
exec sp_executesql #sql;
end
But inside it you must replace 2 things. One, the global uniqueidentifier for the Agent job that you found by doing the trick I described earlier, yes the one with the script to new query window. Two, you must fill in the name of your Python venv replacing the word YOURVENVGOESHERE in the code.
Cool.
Now, with a simple script we can play-test it.
Let's have in a new query window something like this:
use [YourDatabase];
GO
--to set parameters
exec [ExcelPythonToSQL_amend_job_step_params] #pyFile='C:\PythonWorkspace\ExcelPythonToSQL.py',
#SQLInstance='.',
#dbName='YourDatabase',
#ExcelFileFullPath='C:\Temp\ExampleExcelFile.xlsx',
#targetTableName='somenewtable';
--to execute the job
exec msdb.dbo.sp_start_job N'ExcelPythonToSQL', #step_name = N'PowerShell parametrized script';
--let's test that the table is there and drop it.
if object_id('YourDatabase..somenewtable') is not null
begin
select 'Table was here!' [test: table exists?];
drop table [somenewtable];
end
else select 'NADA!' [test: table exists?];
You can run the set parameters part, then the execution, carefull to then wait a little bit like a few seconds, calling the sp_start_job like in this script is asynchronous. And then run the test script to clean up and make sure it had gone in.
That's it.
Obviously lots of variations are possible.
Like in the job step, we could instead call a batch file, we could call a powershell .ps1 file and have the parameters in there, lots and lots of other ways of doing it. I merely described one in this post.

get_process_id() from robot Process() library. "No active process"

I am trying to use the Process() robot framework library to launch and track processes.
https://robot-framework.readthedocs.io/en/v3.0.3/_modules/robot/libraries/Process.html
After I launch my process I am unable to use the get_process_id() method. I wrote a simple example using notepad.exe below
path = "C:\\WINDOWS\\system32"
Process().start_process('notepad.exe',shell=False, cwd=path)
var = Process().get_process_id()
BuiltIn().log_to_console(var)
This gives me the error of "No active process."
Alternatively, using handles as explained in the documentation
path = "C:\\WINDOWS\\system32"
handle = Process().start_process('notepad.exe',shell=False,cwd=path)
var = Process().get_process_id(handle)
BuiltIn().log_to_console(var)
I get the error "Non-existing index or alias '1'."
When you do Process().get_process_id(), you are creating a new instance of the library. This instance doesn't know about any processes started by the previous instance of the library.
You need to get a single instance of the library, and use that consistently.
processLib = Process()
processLib.start_process(...)
var = processLib.get_process_id()
The best thing to do is try to get a reference to an existing process library using BuiltIn().get_library_instance, and only create a new one if it doesn't exist.

creating extension for SPSS with python

So I have discovered the python extension for SPSS, and everything works fine, I have created some scripts now and included them in the extensions map and it works fine. However, now I have created a couple of scripts that require arguments, I thought I could just follow the same method but I guess not.
def Run(args):
import spss
def testing_p(variables):
all_variables = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
variable_nr = [all_variables.index(i) for i in variables]
print all_variables
print variable_nr
With the following .xml-file:
<Command xmlns="http://xml.spss.com/extension" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Name="testing_p" Language="Python">
</Command>
However, this keep throwing the error when calling testing_p(['my_var', 'my_var2']):
Warnings
This command should specify a valid subcommand at the beginning.
Execution of this command stops.
I cannot wrap my head around this because everything works fine when not put in the extensions map and only doing:
BEGIN PROGRAM.
import spss
def testing_p(variables):
all_variables = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
variable_nr = [all_variables.index(i) for i in variables]
print all_variables
print variable_nr
END PROGRAM.
For an extension, which can be writen in Python, R, or Java, you need to create a syntax specification containing the command name, any subcommands, and the arguments and argument types you want. Here is a picture of the start of one (SPSSINC_TURF, which is installed with Statistics).
This will guide the Statistics parser in checking the user input. It also then calls the Run function with a complicated structure containing the user input. You can use the functions in the extensions module to map that to your Python variables and do further validation. Here is a picture of the start of the Run function for SPSSINC TURF.
Finally, if the syntax is valid, your Run function calls the worker function to do something useful, mapping all the parameters to the specified arguments by calling
processcmd(oobj, args, superturf, vardict=spssaux.VariableDict())
which was imported from extensions.py.
Look at the doc for extensions in the help system, and look at some of the extensions installed with Statistics for examples.
Finally, here is a slide from one of my presentations summarizing the flow from user input to results.

How to unit test program interacting with block devices

I have a program that interacts with and changes block devices (/dev/sda and such) on linux. I'm using various external commands (mostly commands from the fdisk and GNU fdisk packages) to control the devices. I have made a class that serves as the interface for most of the basic actions with block devices (for information like: What size is it? Where is it mounted? etc.)
Here is one such method querying the size of a partition:
def get_drive_size(device):
"""Returns the maximum size of the drive, in sectors.
:device the device identifier (/dev/sda and such)"""
query_proc = subprocess.Popen(["blockdev", "--getsz", device], stdout=subprocess.PIPE)
#blockdev returns the number of 512B blocks in a drive
output, error = query_proc.communicate()
exit_code = query_proc.returncode
if exit_code != 0:
raise Exception("Non-zero exit code", str(error, "utf-8")) #I have custom exceptions, this is slight pseudo-code
return int(output) #should always be valid
So this method accepts a block device path, and returns an integer. The tests will run as root, since this entire program will end up having to run as root anyway.
Should I try and test code such as these methods? If so, how? I could try and create and mount image files for each test, but this seems like a lot of overhead, and is probably error-prone itself. It expects block devices, so I cannot operate directly on image files in the file system.
I could try mocking, as some answers suggest, but this feels inadequate. It seems like I start to test the implementation of the method, if I mock the Popen object, rather than the output. Is this a correct assessment of proper unit-testing methodology in this case?
I am using python3 for this project, and I have not yet chosen a unit-testing framework. In the absence of other reasons, I will probably just use the default unittest framework included in Python.
You should look into the mock module (I think it's part of the unittest module now in Python 3).
It enables you to run tests without the need to depened in any external resources while giving you control over how the mocks interact with your code.
I would start from the docs in Voidspace
Here's an example:
import unittest2 as unittest
import mock
class GetDriveSizeTestSuite(unittest.TestCase):
#mock.patch('path/to/original/file.subprocess.Popen')
def test_a_scenario_with_mock_subprocess(self, mock_popen):
mock_popen.return_value.communicate.return_value = ('Expected_value', '')
mock_popen.return_value.returncode = '0'
self.assertEqual('expected_value', get_drive_size('some device'))

Recording the Python script execution method

I have composed an ArcPy script which is run via a windows scheduler.The same script is loaded into a script tool so a user can run the process manually. I've used: get parameters as text, with or's and not's, to hard-wire the standard variables if they are not speicifed.
ReportFolder = arcpy.GetParameterAsText(0)
if ReportFolder == '#' or not ReportFolder:
ReportFolder = "C:\\Data\\GIS"
The process runs and during so writes to a text file log, for example:
txtFile.write("= For ArcGIS 10.3.1: Date: "+str(timed)),txtFile.write ('\n')
I'd like to record what method was used to execute the script; was it via the windows scheduler, or by the script tool via ArcGIS, or by a python client like PyScripter.
Is anyone aware of some form of os environment thingy that can be called by Python?

Categories

Resources