Transfer multiple flowFiles from ExecuteScript in Nifi - python

I am trying to generate multiple flowfiles from one flowfile using an ExecuteScript processor in python.
The ouputs flowfiles depend on one attribute for configuration and the input flowfile (xml content).
I tried many things but I always ends with error like :
this flowfile is already marked for transfer
transfer relationship not specified
Below the last version :
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
import java.io
from org.python.core.util import StringUtil
class PyStreamCallback(StreamCallback):
def __init__(self, flowFile):
global matched
self.parentFlowFile = flowFile
pass
def process(self, inputStream, outputStream):
try:
text_content = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
flowfiles_list = []
new_xml = "blabla"
outputStream.write(bytearray(new_xml.encode('utf-8')))
for n in range(0,5):
flowFile = session.create(self.parentFlowFile)
if (flowFile != None):
flowFile = session.write(flowFile, "Nothing")
flowfiles_list.append(flowFile)
for flow in flowfiles_list:
session.transfer(flow, REL_SUCCESS)
except:
print('Error inside process')
raise
originalFlowFile = session.get()
if(originalFlowFile != None):
try :
originalFlowFile = session.write(originalFlowFile, PyStreamCallback(originalFlowFile))
session.remove(originalFlowFile)
except Exception as e:
originalFlowFile = session.putAttribute(originalFlowFile,'python_error', str(e))
session.transfer(originalFlowFile, REL_FAILURE)
Can someone tell me what I am doing wrong and how to achieve what I want to do?

Here are some notes on your script:
1) You are subclassing StreamCallback and writing to the original flow file, but then you remove it later. StreamCallback is for when you want to overwrite the contents of the existing flow file. If you don't need to do that, you can use InputStreamCallback as the base class, it won't take an outputStream arg but you wouldn't need it in that case. You'd also use session.read on the original flow file rather than session.write.
2) The line flowFile = session.write(flowFile, "Nothing") isn't valid because session.write needs an OutputStreamCallback or StreamCallback as the argument (same as where you call it with PyStreamCallback below). When that throws an error, it gets raised all the way to the top level of the script, but by then you've created a flow file and didn't reach the statement that transfers the flowfiles_list to REL_SUCCESS. Consider adding a try/except around the session.write, then you could remove the newly created flow file and then raise the exception.
3) If you want to read the entire content of the incoming flow file into memory (which you are currently doing), then remove the original flow file and instead create new flow files from it, consider instead using the version of session.read() that returns an InputStream (i.e. doesn't require an InputStreamCallback). Then you can save the contents into a global variable and/or pass it into an OutputStreamCallback when you want to do write something to the created flow files. Something like:
inputStream = session.read(originalFlowFile)
text_content = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
inputStream.close()
flowfiles_list = []
for n in range(0,5):
flowFile = session.create(originalFlowFile)
if (flowFile != None):
try:
flowFile = session.write(flowFile, PyStreamCallback(text_content))
flowfiles_list.append(flowFile)
except Exception as e:
session.remove(flowFile)
raise
for flow in flowfiles_list:
session.transfer(flow, REL_SUCCESS)
session.remove(originalFlowFile)
This doesn't include the refactor of PyStreamCallback to be an OutputStreamCallback that takes a string arg instead of a FlowFile in the constructor.

Related

Python Log File is not created unless basicConfig is called on top before any functions

I have a script that processes csvs and load them to database. My intern mentor wanted us to use log file to capture what's going on and he wanted it to be flexible so one can use a config.ini file to edit where they want the log file to be created. As a result I did just that, using a config file that use key value pairs in a dict that i can extract the path to the log file from. These are excepts from my code where log file is created and used:
dirconfig_file = r"C:\Users\sys_nsgprobeingestio\Documents\dozie\odfs\venv\odfs_tester_history_dirs.ini"
start_time = datetime.now()
def process_dirconfig_file(config_file_from_sysarg):
try:
if Path.is_file(dirconfig_file_Pobj):
parseddict = {}
configsects_set = set()
for sect in config.sections():
configsects_set.add(sect)
for k, v in config.items(sect):
# print('{} = {}'.format(k, v))
parseddict[k] = v
print(parseddict)
try:
if ("log_dir" not in parseddict or parseddict["log_dir"] == "" or "log_dir" not in configsects_set):
raise Exception(f"Error: Your config file is missing 'logfile path' or properly formatted [log_file] section for this script to run. Please edit config file to include logfile path to capture errors")
except Exception as e:
#raise Exception(e)
logging.exception(e)
print(e)
parse_dict = process_dirconfig_file(dirconfig_file)
logfilepath = parse_dict["log_dir"]
log_file_name = start_time.strftime(logfilepath)
print(log_file_name)
logging.basicConfig(
filename=log_file_name,
level=logging.DEBUG,
format='[Probe Data Quality] %(asctime)s - %(name)s %(levelname)-7.7s %(message)s'
# can you explain this Tenzin?
)
if __name__ == '__main__':
try:
startTime = datetime.now()
db_instance = dbhandler(parse_dict["db_string"])
odfs_tabletest_dict = db_instance['odfs_tester_history_files']
odf_history_from_csv_to_dbtable(db_instance)
#print("test exception")
print(datetime.now() - startTime)
except Exception as e:
logging.exception(e)
print(e)
Doing this, no file is created. The script runs with no errors but no log file is created. I've tried several things including using a hardcoded log file name, instead of calling it from the config file but it didn't work
The only thing that works is when the log file is created up top before any method. Why is this?
When you are calling your process_dirconfig_file function, the logging configuration has not been set yet, so no file could have been created. The script executes top to bottom. It would be similar to doing something like this:
import sys
# default logging points to stdout/stderr kind of like this
my_logger = sys.stdout
my_logger.write("Something")
# Then you've pointed logging to a file
my_logger = open("some_file.log", 'w')
my_logger.write("Something else")
Only Something else would be written to our some_file.log, because my_logger pointed somewhere else beforehand.
Much the same is happening here. By default, the logging.<debug/info> functions do nothing because logging won't do anything with them without additional configuration. logging.error, logging.warning, and logging.exception will always at least write to stdout out of the box.
Also, I don't think the inner try is valid Python, you need a matching except. And I wouldn't just print an exception raised by that function, I'd probably raise and have the program crash:
def process_dirconfig_file(config_file_from_sysarg):
try:
# Don't use logging.<anything> yet
~snip~
except Exception as e:
# Just raise or don't use try/except at all until
# you have a better idea of what you want to do in this circumstance
raise
Especially since you are trying to use the logger while validating that its configuration is correct.
The fix? Don't use the logger until after you've determined it's ready.

How to ensure xlwings connection is closed if script fails

I am trying to develop something with xlwings because I need to manipulate a xls file with macros etc. Although it is always good to close connections, Excel is notorious in that it blocks access if more than one instance is running. Therefore I need to make sure that the app closes even though my code fails somewhere upstream.
I am currently doing this with a try statement that spans the whole script and when it fails calls app.quit(). But this suppresses my error messages, which makes debugging hard. So I feel there must be something better.
In another context I have seen with being used. And I have the feeling it would apply here too, but I do not understand how it works, nor how it would work in this specific case.
import xlwings as xw
def myexcel():
try:
#connect to Excel app in the background
excel = xw.App(visible=False)
# open excel book
wb = excel.books.open(str(file))
# asign the active app so it can be closed later
app = xw.apps.active
# more code goes here
except:
app.quit()
How could one make sure that the excel connection gets always closed no-matter the most efficient way?
If with is the solution, I would also appreciate a pointer to a good source to learn more about that concept.
As you mentioned, you can use a with statement and build your own contextmanager. Here's a converted example based on your code:
import xlwings as xw
class MyExcelApp:
def __init__(self):
self.excel = xw.App(visible=False)
def __enter__(self):
return self.excel
def __exit__(self, exc, value, traceback):
# Handle your app-specific exceptions (exc) here
self.excel.quit()
return True
# ^ return True only if you intend to catch all errors in here.
# Otherwise, leave as is and use try... except on the outside.
class MyExcelWorkbook:
def __init__(self, xlapp, bookname):
self.workbook = xlapp.books.open(bookname)
def __enter__(self):
return self.workbook
def __exit__(self, exc, value, traceback):
# Handle your workbook specific exceptions (exc) here
# self.workbook.save() # depends what you want to do here
self.workbook.close()
return True
# ^ return True only if you intend to catch all errors in here.
# Otherwise, leave as is and use try... except on the outside.
With this set up you can simply call it like this:
with MyExcelApp() as app:
with MyExcelWorkbook(filename) as wb:
# do something with wb
You can also implement it with a generator, which will be quite similar to the other answer.
Here's a simplified version:
import xlwings as xw
from contextlib import contextmanager
#contextmanager
def my_excel_app():
app = xw.App(visible=False)
try:
yield app
except: # <-- Add SPECIFIC app exceptions
# Handle the errors
finally:
app.quit()
Usage:
with my_excel() as app:
wb = app.books.open(some_file)
# do something...
you do it right - using try block in this case is the way to go. With statement is good when you need to open file, but not for your use case when you use library which is opening excel file using its own way.
To show details of exception you can change your code as follows:
import xlwings as xw
def myexcel():
try:
#connect to Excel app in the background
excel = xw.App(visible=False)
# open excel book
wb = excel.books.open(str(file))
# asign the active app so it can be closed later
app = xw.apps.active
# more code goes here
finally:
app.quit()
except Exception as e:
print('exception catched: {}'.format(e))
app.quit()
Preferred solution
xlwings added a solution in v0.24.3 to this problem:
xlwings.App() can now be used as context manager, making sure that there are no zombie processes left over on Windows, even if you use a hidden instance and your code fails. It is therefore recommended to use it whenever you can, like so:
import xlwings as xw
with xw.App(visible=False) as app:
wb = xw.Book("test.xlsx")
sheet = wb.sheets['sheet1']
# To evoke an error, I try to call an non-exisiting sheet here.
nonexistent_sheet["A1"]
Solution before v24.0.3
You can use the library traceback, which makes debugging easier, because the error is displayed in red color. See this example:
import xlwings as xw
import traceback
filename = "test.xlsx"
try:
# Do what you want here in the try block. For example, the following lines.
app = xw.App(visible=False)
wb = xw.Book(filename)
sheet = wb.sheets['sheet1']
# To evoke an error, I try to call an nonexistent sheet here.
nonexistent_sheet["A1"]
# Use BaseException because it catches all possible exceptions: https://stackoverflow.com/a/31609619/13968392
except BaseException:
# This prints the actual error in a verbose way.
print(traceback.print_exc())
app.quit()
The error displays with print(traceback.print_exc()) as follows:

executescript in python not passing flow file to next processor

I am trying to run executescript process in Apache Nifi using python but having problem with passing flow file to next processor in my data flow.
If I run the standalone flow file creation and writing snippet it works and I can read flow file in the next processor but when I try to enrich it, it simply does not pass the flow file. In fact no error is generated and somehow I have no clue how to proceed. I am bit new with python and nifi and appreciate your help with this particular issue.
Below is the code I am using and you can see its very simple. I just want to create and write some string to flow file using some logic. But no luck so far
import urllib2
import json
import datetime
import csv
import time
import sys
import traceback
from org.apache.nifi.processor.io import OutputStreamCallback
from org.python.core.util import StringUtil
class WriteContentCallback(OutputStreamCallback):
def __init__(self, content):
self.content_text = content
def process(self, outputStream):
try:
outputStream.write(StringUtil.toBytes(self.content_text))
except:
traceback.print_exc(file=sys.stdout)
raise
page_id = "dsssssss"
access_token = "sdfsdfsf%sdfsdf"
def scrapeFacebookPageFeedStatus(page_id, access_token):
flowFile = session.create()
flowFile = session.write(flowFile, WriteContentCallback("Hello there this is my data"))
flowFile = session.write()
session.transfer(flowFile, REL_SUCCESS)
print "\nDone!\n%s Statuses Processed in %s" % \
(num_processed, datetime.datetime.now() - scrape_starttime)
if __name__ == '__main__':
scrapeFacebookPageFeedStatus(page_id, access_token)
I believe the problem is the check for __main__:
if __name__ == '__main__':
scrapeFacebookPageFeedStatus(page_id, access_token)
__builtin__ was the actual module name in my experiment. You could either remove that check, or add a different one if you want to preserve your separate testing path.

Read from a log file as it's being written using python

I'm trying to find a nice way to read a log file in real time using python. I'd like to process lines from a log file one at a time as it is written. Somehow I need to keep trying to read the file until it is created and then continue to process lines until I terminate the process. Is there an appropriate way to do this? Thanks.
Take a look at this PDF starting at page 38, ~slide I-77 and you'll find all the info you need. Of course the rest of the slides are amazing, too, but those specifically deal with your issue:
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
You could try with something like this:
import time
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
print line, # already has newline
Example was extracted from here.
As this is Python and logging tagged, there is another possibility to do this.
I assume this is based on a Python logger, logging.Handler based.
You can just create a class that gets the (named) logger instance and overwrite the emit function to put it onto a GUI (if you need console just add a console handler to the file handler)
Example:
import logging
class log_viewer(logging.Handler):
""" Class to redistribute python logging data """
# have a class member to store the existing logger
logger_instance = logging.getLogger("SomeNameOfYourExistingLogger")
def __init__(self, *args, **kwargs):
# Initialize the Handler
logging.Handler.__init__(self, *args)
# optional take format
# setFormatter function is derived from logging.Handler
for key, value in kwargs.items():
if "{}".format(key) == "format":
self.setFormatter(value)
# make the logger send data to this class
self.logger_instance.addHandler(self)
def emit(self, record):
""" Overload of logging.Handler method """
record = self.format(record)
# ---------------------------------------
# Now you can send it to a GUI or similar
# "Do work" starts here.
# ---------------------------------------
# just as an example what e.g. a console
# handler would do:
print(record)
I am currently using similar code to add a TkinterTreectrl.Multilistbox for viewing logger output at runtime.
Off-Side: The logger only gets data as soon as it is initialized, so if you want to have all your data available, you need to initialize it at the very beginning. (I know this is what is expected, but I think it is worth being mentioned.)
Maybe you could do a system call to
tail -f
using os.system()

How do you subclass the file type in Python?

I'm trying to subclass the built-in file class in Python to add some extra features to stdin and stdout. Here's the code I have so far:
class TeeWithTimestamp(file):
"""
Class used to tee the output of a stream (such as stdout or stderr) into
another stream, and to add a timestamp to each message printed.
"""
def __init__(self, file1, file2):
"""Initializes the TeeWithTimestamp"""
self.file1 = file1
self.file2 = file2
self.at_start_of_line = True
def write(self, text):
"""Writes text to both files, prefixed with a timestamp"""
if len(text):
# Add timestamp if at the start of a line; also add [STDERR]
# for stderr
if self.at_start_of_line:
now = datetime.datetime.now()
prefix = now.strftime('[%H:%M:%S] ')
if self.file1 == sys.__stderr__:
prefix += '[STDERR] '
text = prefix + text
self.file1.write(text)
self.file2.write(text)
self.at_start_of_line = (text[-1] == '\n')
The purpose is to add a timestamp to the beginning of each message, and to log everything to a log file. However, the problem I run into is that if I do this:
# log_file has already been opened
sys.stdout = TeeWithTimestamp(sys.stdout, log_file)
Then when I try to do print 'foo', I get a ValueError: I/O operation on closed file. I can't meaningfully call file.__init__() in my __init__(), since I don't want to open a new file, and I can't assign self.closed = False either, since it's a read-only attribute.
How can I modify this so that I can do print 'foo', and so that it supports all of the standard file attributes and methods?
Calling file.__init__ is quite feasible (e.g., on '/dev/null') but no real use because your attempted override of write doesn't "take" for the purposes of print statements -- the latter internally calls the real file.write when it sees that sys.stdout is an actual instance of file (and by inheriting you've made it so).
print doesn't really need any other method except write, so making your class inherit from object instead of file will work.
If you need other file methods (i.e., print is not all you're doing), you're best advised to implement them yourself.
You can as well avoid using super :
class SuperFile(file):
def __init__(self, *args, **kwargs):
file.__init__(self, *args, **kwargs)
You'll be able to write with it.

Categories

Resources