I am able to create, drop, modify tables using pyspark and hivecontext. I load a list with commands I want to send, in string format, and pass them into this function:
def hiveCommands(commands, database):
conf = SparkConf().setAppName(database + 'project').setMaster('local')
sc = SparkContext(conf=conf)
df = HiveContext(sc)
f = df.sql('use ' + database)
for command in commands:
f = df.sql(command)
f.collect()
It works fine for maintenance, but I'm trying to dip my toes into analysis, and I don't see any output when I try to send a command like "describe table."
I just that it takes in the command and executes it without any errors, but I don't see what the actual output of the query is. I may need to mess with my .profile or .bashrc, not really sure. Something of a Linux newby. Any help would be appreciated.
Call show method to see results:
for command in commands:
df.sql(command).show()
Related
Task specs:
Import Excel file(s) into MSSQL database(s) using Python, but in a parametrized manner, and using SQL Server Agent job(s).
With the added requirement to set parameter values and/or run the job steps from SQL (query or SP).
And without using Access Database Engine(s) and/or any code that makes use of such drivers (in any wrapping).
First. Let's get some preparatory stuff out of the way.
We will need to set some PowerShell settings.
Run windows PowerShell as Administrator and do:
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
Second. Some assumptions for reasons of clarity.
And those are:
1a. You have at least one instance of SQL2017 or later (Developer / Enterprise / Standard edition) installed and running on your machine.
1b. You have not bootstrapped the installation of this SQL instance so as to exclude Integration Services (SSIS).
1c. There exists SQL Server Agent running, bound to this SQL instance.
1d. You have some SSMS installed.
2a. There is at least one database attached to this instance (if not create one – please refrain from using in-memory filegroups for this exercise, I have not tested on those).
2b. There are no database level DML triggers that log all data changes in a designated table.
3. There is no active Server Audit Specification for this database logging everything we do.
4. Replication is not enabled (I mean the proper MSSQL replication feature not like scripts by 3rd party apps).
For 2b and 3 it's just cause I have not tested this with those on, but for number 4 it defo won't work with that on.
5. You are windows authenticated into the chosen SQL instance and your instance login and db mappings and privileges are sufficient for at least table creation and basic stuff.
Third.
We are going to need some kind of Python script to do this right?
Ok let's make one.
import pandas as pd
import sqlalchemy as sa
import urllib
import sys
import warnings
import os
import re
import time
#COMMAND LINE PARAMETERS
server = sys.argv[1]
database = sys.argv[2]
ExcelFileHolder = sys.argv[3]
SQLTableName = sys.argv[4]
#END OF COMMAND LINE PARAMETERS
excel_sheet_number_left_to_right = 0
warnings.filterwarnings('ignore')
driver = "SQL Server Native Client 11.0"
params = "DRIVER={%s};SERVER=%s;DATABASE=%s;Trusted_Connection=yes;QuotedID=Yes;" % (driver, server, database) #added the explicit "QuotedID=Yes;" to ensure no issues with column names
params = urllib.parse.quote_plus(params) #urllib.parse.quote_plus for Python 3
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s?charset=utf8" % params) #charset is cool to have here
conn = engine.connect()
def execute_sql_trans(sql_string, log_entry):
with conn.begin() as trans:
result = conn.execute(sql_string)
if len(log_entry) >= 1:
log.write(log_entry + "\n")
return result
excelfilesCursor = {}
def process_excel_file(excelfile, excel_sheet_name, tableName, withPyIndexOrSQLIndex, orderByCandidateFields):
withPyIndexOrSQLIndex = 0
excelfilesCursor.update({tableName: withPyIndexOrSQLIndex})
df = pd.read_excel(open(excelfile,'rb'), sheet_name=excel_sheet_name)
now = time.time()
mlsec = repr(now).split('.')[1][:3]
log_string = "Reading file \"" + excelfile + "\" to memory: " + str(time.strftime("%Y-%m-%d %H:%M:%S.{} %Z".format(mlsec), time.localtime(now))) + "\n"
print(log_string)
df.to_sql(tableName, engine, if_exists='replace', index_label='index.py')
now = time.time()
mlsec = repr(now).split('.')[1][:3]
log_string = "Writing file \"" + excelfile + "\", sheet " +str(excel_sheet_name)+ " to SQL instance " +server+ ", into ["+database+"].[dbo].["+tableName+"]: " + str(time.strftime("%Y-%m-%d %H:%M:%S.{} %Z".format(mlsec), time.localtime(now))) + "\n"
print(log_string)
def convert_datetimes_to_dates(tableNameParam):
sql_string = "exec [convert_datetimes_to_dates] '"+tableNameParam+"';"
execute_sql_trans(sql_string, "")
process_excel_file(ExcelFileHolder, excel_sheet_number_left_to_right, SQLTableName, 0, None)
sys.exit(0)
Ok you may or may not notice that my script contains some extra defs, I sometimes use them for convenience you may as well ignore them.
Save the python script somewhere nice say C:\PythonWorkspace\ExcelPythonToSQL.py
Also, needless to mention that you will need some py modules in your venv. The ones you don't already have you need to pip install them obviously.
Fourth.
Connect to your db, SSMS, etc. and create a new Agent job.
Let's call it "ExcelPythonToSQL".
New step, let's call it "PowerShell parametrized script".
Set the Type to PowerShell.
And place this code inside it:
$pyFile="C:\PythonWorkspace\ExcelPythonToSQL.py"
$SQLInstance="SomeMachineName\SomeNamedSQLInstance"
#or . or just the computername or localhost if your SQL instance is a default instance i.e. not a named one.
$dbName="SomeDatabase"
$ExcelFileFullPath="C:\Temp\ExampleExcelFile.xlsx"
$targetTableName="somenewtable"
C:\ProgramData\Miniconda3\envs\YOURVENVNAMEHERE\python $pyFile $SQLInstance $dbName $ExcelFileFullPath $targetTableName
Save the step and the job.
Now let's wrap it around something easier to handle. Because remember, this job and step is not like an SSIS step where you could potentially alter the parameter values in its configuration tab. You don't want to properties the job and the step each time and specify different excel file or target table.
So.
Ah also, do me a solid and do this little trick. Do a small alteration in the code, anything and then instead of OK do a Script to New Query Window. That way we can capture the guid of the job without having to query for it.
So now.
Create a SP like so:
use [YourDatabase];
GO
create proc [ExcelPythonToSQL_amend_job_step_params]( #pyFile nvarchar(max),
#SQLInstance nvarchar(max),
#dbName nvarchar(max),
#ExcelFileFullPath nvarchar(max),
#targetTableName nvarchar(max)='somenewtable'
)
as
begin
declare #sql nvarchar(max);
set #sql = '
exec msdb.dbo.sp_update_jobstep #job_id=N''7f6ff378-56cd-4a8d-ba40-e9057439a5bc'', #step_id=1,
#command=N''
$pyFile="'+#pyFile+'"
$SQLInstance="'+#SQLInstance+'"
$dbName="'+#dbName+'"
$ExcelFileFullPath="'+#ExcelFileFullPath+'"
$targetTableName="'+#targetTableName+'"
C:\ProgramData\Miniconda3\envs\YOURVENVGOESHERE\python $pyFile $SQLInstance $dbName $ExcelFileFullPath $targetTableName''
';
--print #sql;
exec sp_executesql #sql;
end
But inside it you must replace 2 things. One, the global uniqueidentifier for the Agent job that you found by doing the trick I described earlier, yes the one with the script to new query window. Two, you must fill in the name of your Python venv replacing the word YOURVENVGOESHERE in the code.
Cool.
Now, with a simple script we can play-test it.
Let's have in a new query window something like this:
use [YourDatabase];
GO
--to set parameters
exec [ExcelPythonToSQL_amend_job_step_params] #pyFile='C:\PythonWorkspace\ExcelPythonToSQL.py',
#SQLInstance='.',
#dbName='YourDatabase',
#ExcelFileFullPath='C:\Temp\ExampleExcelFile.xlsx',
#targetTableName='somenewtable';
--to execute the job
exec msdb.dbo.sp_start_job N'ExcelPythonToSQL', #step_name = N'PowerShell parametrized script';
--let's test that the table is there and drop it.
if object_id('YourDatabase..somenewtable') is not null
begin
select 'Table was here!' [test: table exists?];
drop table [somenewtable];
end
else select 'NADA!' [test: table exists?];
You can run the set parameters part, then the execution, carefull to then wait a little bit like a few seconds, calling the sp_start_job like in this script is asynchronous. And then run the test script to clean up and make sure it had gone in.
That's it.
Obviously lots of variations are possible.
Like in the job step, we could instead call a batch file, we could call a powershell .ps1 file and have the parameters in there, lots and lots of other ways of doing it. I merely described one in this post.
I have a simple SQL file that I'd like to read and execute using a Python Script Snap in SnapLogic. I created an expression library file to reference the Redshift account and have included it as a parameter in the pipeline.
I have the code below from another post. Is there a way to reference the pipeline parameter to connect to the Redshift database, read the uploaded SQL file and execute the commands?
fd = open('shared/PythonExecuteTest.sql', 'r')
sqlFile = fd.read()
fd.close()
sqlCommands = sqlFile.split(';')
for command in sqlCommands:
try:
c.execute(command)
except OperationalError, msg:
print "Command skipped: ", msg
You can access pipeline parameters in scripts using $_.
Let's say, you have a pipeline parameter executionId. Then to access it in the script you can do $_executionId.
Following is a test pipeline.
With the following pipeline parameter.
Following is the test data.
Following is the script
# Import the interface required by the Script snap.
from com.snaplogic.scripting.language import ScriptHook
import java.util
class TransformScript(ScriptHook):
def __init__(self, input, output, error, log):
self.input = input
self.output = output
self.error = error
self.log = log
# The "execute()" method is called once when the pipeline is started
# and allowed to process its inputs or just send data to its outputs.
def execute(self):
self.log.info("Executing Transform script")
while self.input.hasNext():
try:
# Read the next document, wrap it in a map and write out the wrapper
in_doc = self.input.next()
wrapper = java.util.HashMap()
wrapper['output'] = in_doc
wrapper['output']['executionId'] = $_executionId
self.output.write(in_doc, wrapper)
except Exception as e:
errWrapper = {
'errMsg' : str(e.args)
}
self.log.error("Error in python script")
self.error.write(errWrapper)
self.log.info("Finished executing the Transform script")
# The Script Snap will look for a ScriptHook object in the "hook"
# variable. The snap will then call the hook's "execute" method.
hook = TransformScript(input, output, error, log)
Output:
Here, you can see that the executionId was read from the pipeline parameters.
Note: Accessing pipeline parameters from scripts is a valid scenario but accessing other external systems from the script is complicated (because you would need to load the required libraries) and not recommended. Use the snaps provided by SnapLogic to access external systems. Also, if you want to use other libraries inside scripts, try sticking to Javascript instead of going to python because there are a lot of open source CDNs that you can use in your scripts.
Also, you can't access any configured expression library directly from the script. If you need some logic in the script, you would keep it in the script and not somewhere else. And, there is no point in accessing account names in the script (or mappers) because, even if you know the account name, you can't use the credentials/configurations stored in that account directly; that is handled by SnapLogic. Use the provided snaps and mappers as much as possible.
Update #1
You can't access the account directly. Accounts are managed and used internally by the snaps. You can only create and set accounts through the accounts tab of the relevant snap.
Avoid using script snap as much as possible; especially, if you can do the same thing using normal snaps.
Update #2
The simplest solution to this requirement would be as follows.
Read the file using a file reader
Split based on ;
Execute each SQL command using the Generic JDBC Execute Snap
I want to customize the way IPython saves history. IPython creates a sqlite3 database that contains the history of typed commands, but I want to customize it so that will allow me to save additional information, such as the current working directory when the line of code was executed. A table of this would like like:
Command Directory
import sys /Users/home
import os /Users/home
os.chdir("~/data") /Users/home
data = open("input_file", 'r').read() /Users/home/data
Also, I want to have this done as each line is executed (as opposed to saving as at the end of a session manually).
Does anyone know how to do this? IPython does not seem to have an configurable options that allow you to add custom information to history that I could find on their documentation, or looking at option available in config HistoryManager within IPython
After much playing around, I figured it out. The key is to modify the IPython code that runs the IPython interactive shell. Within the script interactiveshell.py (for me, this is located in ~/anaconda/lib/python3.6/site-packages/IPython/core), put the following block of code in the script interactiveshell.py, at the very beginning of the function run_cell (this is approximately lines 2617):
#location of command
cwd=os.getcwd()
file_1=open("~/.custom_history_file.txt", "a")
#time of command
showtime = strftime("%Y-%m-%d %H:%M:%S", gmtime())
showtime=showtime.rstrip()
#ignore those lines with just empty spaces
if raw_cell.strip():
print (raw_cell,file=file_1,sep="",end="")
print ("#Command Info:\t","cwd: ",cwd,"\tdate: ",showtime,file=file_1)
#print("\n",file=file_1)
#print (raw_cell,cwd,sep="\t",file=file_1)
file_1.close()
At the top of the script, put the following:
from time import gmtime, strftime
Now as you use IPython, your custom history file will be filled with the python commands and the information
I have a malformed database. When I try to get records from any of two tables, it throws an exception:
DatabaseError: database disk image is malformed
I know that through commandline I can do this:
sqlite3 ".dump" base.db | sqlite3 new.db
Can I do something like this from within Python?
As far as i know you cannot do that (alas, i might be mistaken), because the sqlite3 module for python is very limited.
Only workaround i can think of involves calling the os command shell (e.g. terminal, cmd, ...) (more info) via pythons call-command:
Combine it with the info from here to do something like this:
This is done on an windows xp machine:
Unfortunately i can't test it on a unix machine right now - hope it will help you:
from subprocess import check_call
def sqliterepair():
check_call(["sqlite3", "C:/sqlite-tools/base.db", ".mode insert", ".output C:/sqlite-tools/dump_all.sql", ".dump", ".exit"])
check_call(["sqlite3", "C:/sqlite-tools/new.db", ".read C:/sqlite-tools/dump_all.sql", ".exit"])
return
The first argument is calling the sqlite3.exe. Because it is in my system path variable, i don't need to specify the path or the suffix ".exe".
The other arguments are chained into the sqlite3-shell.
Note that the argument ".exit" is required so the sqlite-shell will exit. Otherwise the check_call() will never complete because the outer cmd-shell or terminal will be in suspended.
Of course the dump-file should be removed afterwards...
EDIT: Much shorter solution (credit goes to OP (see comment))
os.system("sqlite3 C:/sqlite-tools/base.db .dump | sqlite3 C:/sqlite-tools/target.db")
Just tested this: it works. Apparently i was wrong in the comments.
If I understood properly, what you want is to duplicate an sqlite3 database in python. Here is how I would do it:
# oldDB = path to the corrupted db,
# newDB = path to the new db
def duplicateDB(oldDB, newDB):
con = sqlite3.connect(oldDB)
script = ''.join(con.iterdump())
con.close()
con = sqlite3.connect(newDB)
con.executescript(script)
con.close()
print "duplicated %s into %s" % (oldDB,newDB)
In your example, call duplicateDB('base.db', 'new.db'). The iterdump function is equivalent to dump.
Note that if you use Python 3, you will need to change the print statement.
I'm working on a python script in Red Hat that connect to a SQL database and run queries on a specific row. What I am attempting to do is to set the row that I want to query as an input variable that get's passed into the script when i start it. For example, when I start the script I would like to just run the command:
./SQLQuery.py Row_To_Query = Row_Name
For a similar script that I have in VBScript I would run the wscript command like this:
wscript SQLQuery.vbs /name:Row_Name
Then the row name would get passed into the script and the relevant data used as the script progresses.
My end goal is to run the python script as a task every 10 minuets or so and pull the relevant data from a specific row, but also have the ability to specify which row the SQL queries are run against without having to edit the script each time I want to run it.
You can use argparse to parse the commandline arguments given in a standard unix-like format:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--row-to-query', type=str, help='row to query')
namespace = parser.parse_args()
row_name = namespace.row_to_query
...
You'd then run the program like:
./SQLQuery.py --row-to-query=some_row_name
or
./SQLQuery.py --row-to-query some_row_name
both invocations should be equivalent.