How to debug a SQL/Python UDF in MonetDB

How to debug a SQL/Python UDF in MonetDB - python

Using native Python code in SQL UDFs in Monetdb is really powerful. BUT, debugging such UDFs could benefit from more support. In particular, if I use the old-fashioned print('debugging info') it disappears in the big black void.
create function dummy()
returns string
language python{
print('Entering the dummy UDF')
return 'hello';
};
How to retrieve this information from the server or MonetDB client.

I was debugging some Python UDF last week :)
Step 1: first make sure your Python code at least works in a Python interpreter.
Step 2: in a Python UDF, write your debugging info. to a file, e.g.:
f = open('/tmp/debug.out', 'w')
f.write('my debugging info\n')
f.close()
This isn't ideal, but it works. Also, I used this to export the parameter values of my Python UDF. In this way, I can run the body of my Python UDF in a Python interpreter with the exact data I receive from MonetDB.

In case someone is still interested in this problem.
There are two novel ways of debugging MonetDB's Python/UDFs.
1) Using the python client pymonetdb (https://github.com/gijzelaerr/pymonetdb).
You can install it throw pip
pip install numpy
To use it, think of the following setting with a table that holds an integer and a UDF that computes the mean absolute deviation of a given column.
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (3), (6), (8), (10);
CREATE OR REPLACE FUNCTION mean_deviation(column INTEGER)
RETURNS DOUBLE LANGUAGE PYTHON {
mean = 0.0
for i in range (0, len(column)):
mean += column[I]
mean = mean / len(column)
distance = 0.0
for i in range (0, len(column)):
distance += column[i] - mean
deviation = distance/len(column)
return deviation;
};
To debug your function using terminal debugging (i.e., pdb) you just need to open a database connection using pymonetdb.connect(), later you get a cursor object from the connection, and through the cursor object you call the debug() function, sending as parameters the SQL you want to examine and the UDF name you wish to debug.
import pymonetdb
conn = pymonetdb.connect(database='demo') #Open Database connection
c = conn.cursor()
sql = 'select mean_deviation(i) from integers;'
c.debug(sql, 'mean_deviation') #Console Debugging
There is an optional sampling step that only transfers a uniform random sample of the data instead of the full input data set. If you wish to sample you just need to send the number of elements you wish to get from the sampling (e.g., c.debug(sql, 'mean_deviation', 10) in case you desire the subset of 10 elements)
2) Using a POC plugin for PyCharm called devudf, which you can install throw the plugin page of pycharm, or by directly going to the JetBrains page: https://plugins.jetbrains.com/plugin/12063-devudf. It adds an option to the main menu called "UDF Development" and allows for you do directly import and export UDFs from your database directly to pycharm, and enjoy the IDE's debugging capabilities.

Related

Select different dataset when testing | Separate test from production

This question is partly about how to test external dependencies (a.k.a. integration tests) and partly how to implement it with Python for SQL with BigQuery in specific. So answers only about 'This is how you should do integration tests' are very welcome.
In my project I have two different datasets
'project_1.production.table_1'
'project_1.development.table_1'
When running my tests I would like to call the development environment. But how to separate it properly from my production code as I don't want to clutter my production code with test(set-up) code.
Production code looks like:
def find_data(variable_x: string) -> DataFrame:
query = '''
SELECT *
FROM `project_1.production.table_1`
WHERE foo = #variable_x
'''
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter(
name='foo', type_="STRING", value=variable_x
)
]
)
df = self.client.query(
query=query, job_config=job_config).to_dataframe()
return df
Solution 1 : Environment variables for the dataset
The python-dotenv module can be used to differentiate production from development, as I do for some parts of my code. The problem is that bigQuery does not allow to parameterize the dataset. (To prevent SQL-injection I think) See running parameterized queries docs
From the docs
Parameters cannot be used as substitutes for identifiers, column
names, table names, or other parts of the query.
So having the environment variable as dataset name is not possible.
Solution 2 : Environment variable for flow control
I could add a if production == True evaluation and select the dataset. However this results in test/debug code in my production code. I would like to avoid it as much as possible.
from os import getenv
def find_data(variable_x : string) -> Dataframe:
load_dotenv()
PRODUCTION = getenv("PRODUCTION")
if PRODUCTION == TRUE:
*Execute query on project_1.production.table_1*
else:
*Execute query on project_1.development.table_1*
job_config = (*snip*)
df = (*snip*)
return df
Solution 3 : Mimic function in testcode
Make a copy of the production code and set up the test code so that the development dataset is called.
This leads to duplication of code (one in production code and one in test code). A result of this duplication will lead to a mismatch of the code may the implementation of the function change over time. So I think this solution is not 'Embracing Change'
Solution 4 : Skip testing this function
Perhaps this function does not need to be called at all in my test code. Just take a snippet of the result of this query and use the result as a 'data injection' into the tests that depend on this result. However then I need to adjust my architecture a bit.
The above solutions don't satisfy me completely. I wonder if there is another way to solve this issue or if one of the above solutions is acceptable?

It looks like string formatting (sometimes referred to as string interpolation) might be enough to get you where you want. You could replace the first part of your function by the following code:
query = '''
SELECT *
FROM `{table}`
WHERE foo = #variable_x
'''.format(table = getenv("DATA_TABLE"))
This works because the query is just a string and you can do whatever you want with it before you pass it on the the BigQuery library. The String.format allows us to replace values inside a string, which is exactly what we need (see this article for a more in depth explanation about String.format)
Important security note: it is in general a bad security practice to manipulate SQL queries as plain strings (as we are doing here), but since you control the environment variables of the application it should be safe in this particular case.

How to change username of job in print queue using python & win32print

I am trying to change the user of a print job in the queue, as I want to create it on a service account but send the job to another users follow-me printing queue. I'm using the win32 module in python. Here is an example of my code:
from win32 import win32print
JOB_INFO_LEVEL = 2
pclExample = open("sample.pcl")
printer_name = win32print.GetDefaultPrinter()
hPrinter = win32print.OpenPrinter(printer_name)
try:
jobID = win32print.StartDocPrinter(hPrinter, 1, ("PCL Data test", None, "RAW"))
# Here we try to change the user by extracting the job and then setting it again
jobInfoDict = win32print.GetJob(hPrinter, jobID , JOB_INFO_LEVEL )
jobInfoDict["pUserName"] = "exampleUser"
win32print.SetJob(hPrinter, jobID , JOB_INFO_LEVEL , jobInfoDict , win32print.JOB_CONTROL_RESUME )
try:
win32print.StartPagePrinter(hPrinter)
win32print.WritePrinter(hPrinter, pclExample)
win32print.EndPagePrinter(hPrinter)
finally:
win32print.EndDocPrinter(hPrinter)
finally:
win32print.ClosePrinter(hPrinter)
The problem is I get an error at the win32print.SetJob() line. If JOB_INFO_LEVEL is set to 1, then I get the following error:
(1804, 'SetJob', 'The specified datatype is invalid.')
This is a known bug to do with how the C++ works in the background (Issue here).
If JOB_INFO_LEVEL is set to 2, then I get the following error:
(1798, 'SetJob', 'The print processor is unknown.')
However, this is the processor that came from win32print.GetJob(). Without trying to change the user this prints fine, so I'm not sure what is wrong.
Any help would be hugely appreciated! :)
EDIT:
Using Python 3.8.5 and Pywin32 303

At the beginning I thought it was a misunderstanding (I was also a bit skeptical about the bug report), mainly because of the following paragraph (which apparently seems to be wrong) from [MS.Docs]: SetJob function (emphasis is mine):
The following members of a JOB_INFO_1, JOB_INFO_2, or JOB_INFO_4 structure are ignored on a call to SetJob: JobId, pPrinterName, pMachineName, pUserName, pDrivername, Size, Submitted, Time, and TotalPages.
But I did some tests and ran into the problem. The problem is as described in the bug: filling JOB_INFO_* string members (which are LPTSTRs) with char* data.
Submitted [GitHub]: mhammond/pywin32 - Fix: win32print.SetJob sending ANSI to UNICODE API (and none of the 2 errors pops up). It was merged to main on 220331.
When testing the fix, I was able to change various properties of an existing job, I was amazed that it didn't have to be valid data (like below), I'm a bit curious to see what would happen when the job would be executed (as now I don't have a connection to a printer):
Change pUserName to str(random.randint(0, 10000)) to make sure it changes on each script run (PrintScreens taken separately and assembled in Paint):
Ways to go further:
Wait for a new PyWin32 version (containing this fix) to be released. This is the recommended approach, but it will also take more time (and it's unclear when it will happen)
Get the sources, either:
from main
from b303 (last stable branch), and apply the (above) patch(1)
build the module (.pyd) and copy it in the PyWin32's site-packages directory on your Python installation(s). Faster, but it requires some deeper knowledge, and maintenance might become a nightmare
Footnotes
#1: Check [SO]: Run / Debug a Django application's UnitTests from the mouse right click context menu in PyCharm Community Edition? (#CristiFati's answer) (Patching UTRunner section) for how to apply patches (on Win).

Converting Intersystems cache objectscript into a python function

I am accessing an Intersystems cache 2017.1.xx instance through a python process to get various attributes about the database in able to monitor the database.
One of the items I want to monitor is license usage. I wrote a objectscript script in a Terminal window to access license usage by user:
s Rset=##class(%ResultSet).%New("%SYSTEM.License.UserListAll")
s r=Rset.Execute()
s ncol=Rset.GetColumnCount()
While (Rset.Next()) {f i=1:1:ncol w !,Rset.GetData(i)}
But, I have been unable to determine how to convert this script into a Python equivalent. I am using the intersys.pythonbind3 import for connecting and accessing the cache instance. I have been able to create python functions that accessing most everything else in the instance but this one piece of data I can not figure out how to translate it to Python (3.7).

Following should work (based on the documentation):
query = intersys.pythonbind.query(database)
query.prepare_class("%SYSTEM.License","UserListAll")
query.execute();
# Fetch each row in the result set, and print the
# name and value of each column in a row:
while 1:
cols = query.fetch([None])
if len(cols) == 0: break
print str(cols[0])
Also, notice that InterSystems IRIS -- successor to the Caché now has Python as an embedded language. See more in the docs

Since the noted query "UserListAll" is not defined correctly in the library; not SqlProc. So to resolve this issue would require a ObjectScript with the query and the use of #Result set or similar in Python to get the results. So I am marking this as resolved.

Not sure which Python interface you're using for Cache/IRIS, but this Open Source 3rd party one is worth investigating for the kind of things you're trying to do:
https://github.com/chrisemunt/mg_python

Sequence nextval/currval in two sessions

Setup:
Oracle DB running on a windows machine
Mac connected with the database, both in the same network
Problem:
When I created a sequence in SQL Developer, I can see and use the sequence in this session. If I logoff and login again the sequence is still there. But if I try to use the sequence via Python and cx_Oracle, it doesn't work. It also doesn't work the other way around.
[In SQL Developer: user: uc]
create SEQUENCE seq1;
select seq1.nextval from dual; ---> 1
commit; --> although the create statement is a DDL method, just in case
[login via Python, user: uc]
select seq1.currval from dual;--> ORA-08002 Sequence seq1.currval isn't defined in this session
The python code:
import cx_Oracle
cx_Oracle.init_oracle_client(lib_dir="/Users/benreisinger/Documents/testclients/instantclient_19_8", config_dir=None, error_url=None, driver_name=None)
# Connect as user "hr" with password "hr" to the "orclpdb" service running on a remote computer.
connection = cx_Oracle.connect("uc", "uc", "10.0.0.22/orcl")
cursor = connection.cursor()
cursor.execute("""
select seq1.currval from dual
""")
print(cursor)
for seq1 in cursor:
print(seq1)
The error says, that [seq1] wasn't defined in this session, but why does the following work:
select seq1.nextval from dual
--> returns 2
Even after issuing this, I can't use seq1.currval
Btw., select sequence_name from user_sequences returns seq1in Python
[as SYS user]
select * from v$session
where username = 'uc';
--> returns zero rows
Why is seq1 not in reach for the python program ?
Note: With tables, everything just works fine
EDIT:
also with 'UC' being upper case, no rows returned
first issuing
still doesn't work

Not sure how to explain this. The previous 2 answers are correct, but somehow you seem to miss the point.
First, take everything that is irrelevant out of the equation. Mac client on Windows db: doesn't matter. SQLDeveloper vs python: doesn't matter. The only thing that matters is that you connect twice to the database as the same schema. You connect twice, that means that you have 2 separate sessions and those sessions don't know about each other. Both sessions have access to the same database objects, so you if you execute ddl (eg create sequence), that object will be visible in the other session.
Now to the core of your question. The oracle documentation states
"To use or refer to the current sequence value of your session, reference seq_name.CURRVAL. CURRVAL can only be used if seq_name.NEXTVAL has been referenced in the current user session (in the current or a previous transaction)."
You have 2 different sessions, so according to the documentation, you should not be able to call seq_name.CURRVAL in the other session. That is exactly the behaviour you are seeing.
You ask "Why is seq1 not in reach for the python program ?". The answer is: you're not correct, it is in reach for the python program. You can call seq1.NEXTVAL from any session. But you cannot invoke seq1.NEXTVAL from one session (SQLDeveloper) and then invoke seq1.CURRVAL from another session (python) because that is just how sequences works as stated in documentation.
Just to confirm you're not in the same session, execute the following statement for both clients (SQLDeveloper and python):
select sys_context('USERENV','SID') from dual;
You'll notice that the session id is different.

CURRVAL returns the last allocated sequence number in the current session. So it only works when we have previously executed a NEXTVAL. So these two statements will return the same value when run in the same session:
select seq1.nextval from dual
/
select seq1.currval from dual
/
It's not entirely clear what you're trying to achieve, but it looks like your python code is executing a single statement for the connection, so it's not tapping into an existing session.
This statement returns zero rows ...
select * from v$session
where username = 'uc';
... because database objects in Oracle are stored in UPPER case (at least by default, but it's wise to stick with that default. So use where username = 'UC' instead.

Python established a new session. In it, sequence hasn't been invoked yet, so its currval doesn't exist. First you have to select nextval (which, as you said, returned 2) - only then currval will make sense.
Saying that
Even after issuing this, I can't use seq1.currval
is hard to believe.
This: select * From v$session where username = 'uc' returned nothing because - by default - all objects are stored in uppercase, so you should have ran
.... where username = 'UC'
Finally:
commit; --> although the create statement is a DDL method, just in case
Which case? There's no case. DDL commits. Moreover, commits twice (before and after the actual DDL statement). And there's nothing to commit either. Therefore, what you did is unnecessary and pretty much useless.

How to assign a 2d libreoffice calc named range to a python variable. Can do it in Libreoffice Basic

I can't seem to find a simple answer to the question. I have this successfully working in Libreoffice Basic:
NamedRange = ThisComponent.NamedRanges.getByName("transactions_detail")
RefCells = NamedRange.getReferredCells()
Set MainRange = RefCells.getDataArray()
Then I iterate over MainRange and pull out the rows I am interested in.
Can I do something similar in a python macro? Can I assign a 2d named range to a python variable or do I have to iterate over the range to assign the individual cells?
I am new to python but hope to convert my iteration intensive macro function to python in hopes of making it faster.
Any help would be much appreciated.
Thanks.

LibreOffice can be manipulated from Python with the library pyuno. The documentation of pyuno is unfortunately incomplete but going through this tutorial may help.
To get started:
Python-Uno, the library to communicate via Uno, is already in the LibreOffice Python’s path. To initialize your context, type the following lines in your python shell :
import socket # only needed on win32-OOo3.0.0
import uno
# get the uno component context from the PyUNO runtime
localContext = uno.getComponentContext()
# create the UnoUrlResolver
resolver = localContext.ServiceManager.createInstanceWithContext(
"com.sun.star.bridge.UnoUrlResolver", localContext )
# connect to the running office
ctx = resolver.resolve( "uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext" )
smgr = ctx.ServiceManager
# get the central desktop object
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",ctx)
# access the current writer document
model = desktop.getCurrentComponent()
Then to get a named range and access the data as an array, you can use the following methods:
NamedRange = model.NamedRanges.getByName(“Test Name”)
MainRange = NamedRange.getDataArray()
However I am unsure that this will result in a noticeable preformance gain.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.