I am trying to run executescript process in Apache Nifi using python but having problem with passing flow file to next processor in my data flow.
If I run the standalone flow file creation and writing snippet it works and I can read flow file in the next processor but when I try to enrich it, it simply does not pass the flow file. In fact no error is generated and somehow I have no clue how to proceed. I am bit new with python and nifi and appreciate your help with this particular issue.
Below is the code I am using and you can see its very simple. I just want to create and write some string to flow file using some logic. But no luck so far
import urllib2
import json
import datetime
import csv
import time
import sys
import traceback
from org.apache.nifi.processor.io import OutputStreamCallback
from org.python.core.util import StringUtil
class WriteContentCallback(OutputStreamCallback):
def __init__(self, content):
self.content_text = content
def process(self, outputStream):
try:
outputStream.write(StringUtil.toBytes(self.content_text))
except:
traceback.print_exc(file=sys.stdout)
raise
page_id = "dsssssss"
access_token = "sdfsdfsf%sdfsdf"
def scrapeFacebookPageFeedStatus(page_id, access_token):
flowFile = session.create()
flowFile = session.write(flowFile, WriteContentCallback("Hello there this is my data"))
flowFile = session.write()
session.transfer(flowFile, REL_SUCCESS)
print "\nDone!\n%s Statuses Processed in %s" % \
(num_processed, datetime.datetime.now() - scrape_starttime)
if __name__ == '__main__':
scrapeFacebookPageFeedStatus(page_id, access_token)
I believe the problem is the check for __main__:
if __name__ == '__main__':
scrapeFacebookPageFeedStatus(page_id, access_token)
__builtin__ was the actual module name in my experiment. You could either remove that check, or add a different one if you want to preserve your separate testing path.
Related
I have two python files Cron.py and Workflow.py. Workflow.py process files which are newly created and Cron.py calls Workflow.py every 5 seconds using scheduler.
When I execute Cron.py, the code works fine until all files to be processed. But as soon as there are no files to process, Cron.py throws attribute error:
ERROR:root:'Cron' object has no attribute
Traceback (most recent call last):
File "C:\Workflow.py", line 371, in Start
self.setupLoggingToFile()
AttributeError: 'Cron' object has no attribute 'setupLoggingToFile'
Below is my Cron code:
import schedule
from Workflow import Workflow as w
import time
class Cron:
def start_job(self):
print('************Cron Job Cycle Started**************')
w.Start(self)
print('************Cron Job Cycle Ended **************')
def Start(self):
scheduler = schedule()
scheduler.every(5).seconds.do(self.start_job())
while 1:
scheduler.run_pending()
time.sleep(1)
A = Cron()
A.start_job()
and Workflow.py code:
import os
import subprocess
import pyodbc
import time
from multiprocessing.dummy import Pool as ThreadPool
from lxml import etree
import os.path
import datetime
from os import listdir
from os.path import isfile, join
import logging
import logging.handlers
from logging.handlers import RotatingFileHandler
import json
import pandas
class Workflow:
def setupLoggingToFile(self):
logging.basicConfig(
# filemode='a',
format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
datefmt= '%m-%d-%y %H:%M:%S',
level=logging.DEBUG,
handlers=[RotatingFileHandler('C:/ExceptionLogFiles/MyLogs', maxBytes = 10485760, backupCount=100)]
)
def Start(self):
try:
print('Cron started..')
# self.createFolders()
self.setupLoggingToFile()
print('Files Folder Setup completed ..')
#Get number of files which are not processed.
files = self.GetRequestedFilesCount()
if files[0] > 0:
print(str(files[0]) + " files to be processed..")
# Get filing ids and status of files which are to be processed
resultset = self.GetRequestedFileInfo()
filingId = []
for fileid, status in resultset:
filingId.append(str(fileid) + "##" + str(status))
#Create Threads based on number of filings ids to be processed.
pool = ThreadPool(len(filingId))
results = pool.map(self.ProcessFile,filingId) ## Process the filingIds in parallel.
pool.close()
pool.join()
else:
print("No Files to be Processed.")
except AttributeError:
logging.exception("'Cron' object has no attribute ", exc_info=True)
except Exception:
logging.exception("ProcessFile Function: Filing ID: {} ".format(filingId), exc_info=True)
A = Workflow()
A.Start()
Any idea how to tell Cron.py to stop peacefully without exception?
It looks like the problem is in Cron.start_job, with the line
w.Start(self)
The self variable inside Cron is the Cron instance.
You then pass it to the Start() method of the Workflow class w.
This workflow class then calls self.setupLoggingToFile(), but self is now the Cron instance you passed, and not the Workflow instance that it should be.
Therefore, the error message says exactly what you expect: 'Cron' object has no attribute 'setupLoggingToFile'
A possible solution is to create an instance of your Workflow class and pass it to the start_job method of Cron, for example:
from Workflow import Workflow
class Cron:
def start_job(self, workflow_instance):
workflow_instance.Start()
def Start(self, workflow_instance):
scheduler = schedule()
scheduler.every(5).seconds.do(self.start_job(workflow_instance))
# ...
# NOTE: large parts of original code left out for clarity
w = Workflow()
c = Cron()
c.start_job(w)
The question remains whether the code, with these changes, does what you expect it to do. Depending on what exactly the Workflow class does, you may get some problems if a job does not finish within 5 seconds.
I am trying to generate multiple flowfiles from one flowfile using an ExecuteScript processor in python.
The ouputs flowfiles depend on one attribute for configuration and the input flowfile (xml content).
I tried many things but I always ends with error like :
this flowfile is already marked for transfer
transfer relationship not specified
Below the last version :
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
import java.io
from org.python.core.util import StringUtil
class PyStreamCallback(StreamCallback):
def __init__(self, flowFile):
global matched
self.parentFlowFile = flowFile
pass
def process(self, inputStream, outputStream):
try:
text_content = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
flowfiles_list = []
new_xml = "blabla"
outputStream.write(bytearray(new_xml.encode('utf-8')))
for n in range(0,5):
flowFile = session.create(self.parentFlowFile)
if (flowFile != None):
flowFile = session.write(flowFile, "Nothing")
flowfiles_list.append(flowFile)
for flow in flowfiles_list:
session.transfer(flow, REL_SUCCESS)
except:
print('Error inside process')
raise
originalFlowFile = session.get()
if(originalFlowFile != None):
try :
originalFlowFile = session.write(originalFlowFile, PyStreamCallback(originalFlowFile))
session.remove(originalFlowFile)
except Exception as e:
originalFlowFile = session.putAttribute(originalFlowFile,'python_error', str(e))
session.transfer(originalFlowFile, REL_FAILURE)
Can someone tell me what I am doing wrong and how to achieve what I want to do?
Here are some notes on your script:
1) You are subclassing StreamCallback and writing to the original flow file, but then you remove it later. StreamCallback is for when you want to overwrite the contents of the existing flow file. If you don't need to do that, you can use InputStreamCallback as the base class, it won't take an outputStream arg but you wouldn't need it in that case. You'd also use session.read on the original flow file rather than session.write.
2) The line flowFile = session.write(flowFile, "Nothing") isn't valid because session.write needs an OutputStreamCallback or StreamCallback as the argument (same as where you call it with PyStreamCallback below). When that throws an error, it gets raised all the way to the top level of the script, but by then you've created a flow file and didn't reach the statement that transfers the flowfiles_list to REL_SUCCESS. Consider adding a try/except around the session.write, then you could remove the newly created flow file and then raise the exception.
3) If you want to read the entire content of the incoming flow file into memory (which you are currently doing), then remove the original flow file and instead create new flow files from it, consider instead using the version of session.read() that returns an InputStream (i.e. doesn't require an InputStreamCallback). Then you can save the contents into a global variable and/or pass it into an OutputStreamCallback when you want to do write something to the created flow files. Something like:
inputStream = session.read(originalFlowFile)
text_content = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
inputStream.close()
flowfiles_list = []
for n in range(0,5):
flowFile = session.create(originalFlowFile)
if (flowFile != None):
try:
flowFile = session.write(flowFile, PyStreamCallback(text_content))
flowfiles_list.append(flowFile)
except Exception as e:
session.remove(flowFile)
raise
for flow in flowfiles_list:
session.transfer(flow, REL_SUCCESS)
session.remove(originalFlowFile)
This doesn't include the refactor of PyStreamCallback to be an OutputStreamCallback that takes a string arg instead of a FlowFile in the constructor.
I am reading some data from gpsd and writing it using using Python's logging module. I am fairly confident that only one process writes to this file, although I read from it using tail while this logger is running. Every once in a while I am seeing some log entries that look like the image below. I am hoping that somebody can shed some light on what would cause (presumably Python) to insert null control characters into my log file.
The code I am using is:
"""
Read the GPS continuously
"""
import logging
from logging.handlers import RotatingFileHandler
import sys
import gps
LOG = '/sensor_logs/COUNT.csv'
def main():
log_formatter = logging.Formatter('%(asctime)s,%(message)s', "%Y-%m-%d %H:%M:%S")
my_handler = RotatingFileHandler(LOG, mode='a', maxBytes=1024*1024,
backupCount=1, encoding=None, delay=0)
my_handler.setFormatter(log_formatter)
my_handler.setLevel(logging.INFO)
app_log = logging.getLogger('root')
app_log.setLevel(logging.INFO)
app_log.addHandler(my_handler)
session = gps.gps("localhost", "2947")
session.stream(gps.WATCH_ENABLE | gps.WATCH_NEWSTYLE)
while True:
try:
report = session.next()
if hasattr(report, 'gdop'):
satcount = 0
for s in report['satellites']:
if s['used'] == True:
satcount+=1
data = "{}".format(str(satcount))
app_log.info(data)
except KeyError:
pass
except KeyboardInterrupt:
quit()
except StopIteration:
session = None
if __name__ == "__main__":
main()
This ended up being a result of failing hardware. Thank you for your help, #RobertB.
I have read many article over the last 6 hours and i still don't understand mocking and unit-testing. I want to unit test a open function, how can i do this correctly?
i am also concerned as the bulk of my code is using external files for data import and manipulation. I understand that i need to mock them for testing, but I am struggling to understand how to move forward.
Some advice please. Thank you in advance
prototype5.py
import os
import sys
import io
import pandas
pandas.set_option('display.width', None)
def openSetupConfig (a):
"""
SUMMARY
Read setup file
setup file will ONLY hold the file path of the working directory
:param a: str
:return: contents of the file stored as str
"""
try:
setupConfig = open(a, "r")
return setupConfig.read()
except Exception as ve:
ve = (str(ve) + "\n\nPlease ensure setup file " + str(a) + " is available")
sys.exit(ve)
dirPath = openSetupConfig("Setup.dat")
test_prototype5.py
import prototype5
import unittest
class TEST_openSetupConfig (unittest.TestCase):
"""
Test the openSetupConfig function from the prototype 5 library
"""
def test_open_correct_file(self):
result = prototype5.openSetupConfig("Setup.dat")
self.assertTrue(result)
if __name__ == '__main__':
unittest.main()
So the rule of thumb is to mock, stub or fake all external dependencies to the method/function under test. The point is to test the logic in isolation. So in your case you want to test that it can open a file or log an error message if it can't be opened.
import unittest
from mock import patch
from prototype5 import openSetupConfig # you don't want to run the whole file
import __builtin__ # needed to mock open
def test_openSetupConfig_with_valid_file(self):
"""
It should return file contents when passed a valid file.
"""
expect = 'fake_contents'
with patch('__builtin__.open', return_value=expect) as mock_open:
actual = openSetupConfig("Setup.dat")
self.assertEqual(expect, actual)
mock_open.assert_called()
#patch('prototype5.sys.exit')
def test_openSetupConfig_with_invalid_file(self, mock_exit):
"""
It should log an error and exit when passed an invalid file.
"""
with patch('__builtin__.open', side_effect=FileNotFoundError) as mock_open:
openSetupConfig('foo')
mock_exit.assert_called()
I'm just wondering the behaviour of Python and how it really works. I have a script to run and collect all followers and friends of an account.
This is the code.
#!/usr/bin/env python
import pymongo
import tweepy
from pymongo import MongoClient
from sweepy.get_config import get_config
config = get_config()
consumer_key = config.get('PROCESS_TWITTER_CONSUMER_KEY')
consumer_secret = config.get('PROCESS_TWITTER_CONSUMER_SECRET')
access_token = config.get('PROCESS_TWITTER_ACCESS_TOKEN')
access_token_secret = config.get('PROCESS_TWITTER_ACCESS_TOKEN_SECRET')
MONGO_URL = config.get('MONGO_URL')
MONGO_PORT = config.get('MONGO_PORT')
MONGO_USERNAME = config.get('MONGO_USERNAME')
MONGO_PASSWORD = config.get('MONGO_PASSWORD')
client = MongoClient(MONGO_URL, int(MONGO_PORT))
print 'Establishing Tweepy connection'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, retry_count=3)
db = client.tweets
db.authenticate(MONGO_USERNAME, MONGO_PASSWORD)
raw_tweets = db.raw_tweets
users = db.users
def is_user_in_db(screen_name):
return get_user_from_db(screen_name) is None
def get_user_from_db(screen_name):
return users.find_one({'screen_name' : screen_name})
def get_user_from_twitter(user_id):
return api.get_user(user_id)
def get_followers(screen_name):
users = []
for i, page in enumerate(tweepy.Cursor(api.followers, id=screen_name, count=200).pages()):
print 'Getting page {} for followers'.format(i)
users += page
return users
def get_friends(screen_name):
users = []
for i, page in enumerate(tweepy.Cursor(api.friends, id=screen_name, count=200).pages()):
print 'Getting page {} for friends'.format(i)
users += page
return users
def get_followers_ids(screen_name):
ids = []
for i, page in enumerate(tweepy.Cursor(api.followers_ids, id=screen_name, count=5000).pages()):
print 'Getting page {} for followers ids'.format(i)
ids += page
return ids
def get_friends_ids(screen_name):
ids = []
for i, page in enumerate(tweepy.Cursor(api.friends_ids, id=screen_name, count=5000).pages()):
print 'Getting page {} for friends ids'.format(i)
ids += page
return ids
def process_user(user):
screen_name = user['screen_name']
print 'Processing user : {}'.format(screen_name)
if is_user_in_db(screen_name):
user['followers_ids'] = get_followers_ids(screen_name)
user['friends_ids'] = get_friends_ids(screen_name)
users.insert_one(user)
else:
print '{} exists!'.format(screen_name)
print 'End processing user : {}'.format(screen_name)
if __name__ == "__main__":
for doc in raw_tweets.find({'processed' : {'$exists': False}}):
print 'Start processing'
try:
process_user(doc['user'])
except KeyError:
pass
try:
process_user(doc['retweeted_status']['user'])
except KeyError:
pass
raw_tweets.update_one({'_id': doc['_id']}, {'$set':{'processed':True}})
What I keep getting from the log is
Rate limit reached. Sleeping for: 889
Establishing Tweepy connection
Start processing
Processing user : littleaddy80
Establishing Tweepy connection
Start processing
Processing user : littleaddy80
Establishing Tweepy connection
Start processing
Processing user : littleaddy80
Establishing Tweepy connection
Start processing
Processing user : littleaddy80
Rate limit reached. Sleeping for: 891
I'm wondering because Establishing Tweepy connection is outside of __main__ and it shouldn't be running over and over again. I'm just wondering why Python behaves like that or there's a bug in my code?
When you run/import a python script every statement in it is executed (however when imported this will only happen first time the module is imported or when you do reload(module)). There are a few normally present statements that could be noted:
The execution of function definition means that the function is being defined (not executing the body of the function).
The execution of an import statement will import the module.
The execution of a class definition implies that the body of it is executed, mostly it will contain function definitions so it's mostly defining functions.
The execution of if statements means that the controlling expression is first evaluated and depending on that the body may be executed.
The execution of assignments means that the rhs-expression will be evaluated with possible side effects.
This is why one normally don't put code directly in the top level of a python script - it will be executed. If it should work as both a script and a module - the code that should be run when running as a script should be enclosed in a if __name__ == '__main__'-statement.
Unless you need global variabes your script would be a bunch of function definitions and class definitions followed by:
if __name__ == "__main__":
code_to_be_executed_iff_run_as_a_script()
else:
code_to_be_executed_iff_imported()
if you need global variables you will have to take special care sometimes to avoid side effects when running/importing the module.
If you want code that runs only when imported, it would go in the else clause of the normal __main__ guard:
if __name__ == '__main__':
print("Run as a script")
else:
print("Imported as a module")
That's exactly th reason why there's
if __name__ == "__main__":
Before this condition you should have functions and classes definitions and after it, code you would like to run.
Reason for this is that the __name__ variable is different when your file is imported (as every python file is also importable module) and run e.g. python myfile.py.
Create file e.g. myfile.py:
# content of myfile.py
print(__name__)
When you run it it will print __main__.
$ python myfile.py
__main__
But during import it carries the name of the imported module (myfile).
$ python
>>> import myfile
myfile