After reading the documentation on Output Caching based on a file target
, I figured this workflow should be an example of output caching:
from time import sleep
from prefect import Flow, task
from prefect.engine.results import LocalResult
#task(target="func_task_target.txt", checkpoint=True,
result=LocalResult(dir="~/.prefect"))
def func_task():
sleep(5)
return 99
with Flow("Test-cache") as flow:
func_task()
if __name__ == '__main__':
flow.run()
I would expect func_task to run one time, get cached, and then use the cached value next time I run the flow. However, it seems that func_task runs each time.
Where am I going wrong? Or have I misunderstood the documentation?
Try setting environment variable PREFECT__FLOWS__CHECKPOINTING to True
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
you can also change the results dir
os.environ["PREFECT__HOME_DIR"] = "path to dir"
Related
I am trying to simplify a workflow that requires several individual scripts to be run. So far, I have been able to write a script that runs the other scripts but I have one issue that I can't seem to resolve. Each of the sub-scripts requires a file path and one argument within the path needs to be changed depending on who runs the scripts. Currently, I have to open each sub-script and manually changing this argument.
Is it possible to set this argument to a variable in the parent script, which can then be passed to the subscripts? Thus, it will only need to be set once and will no longer require it to be updated in each sub-script.
So far I have.....
import os
def driver(path: str):
path_base = path
path_use = os.path.join(path_base, 'docs', 'analysis', 'forecast')
file_cash = os.path.join(path_use, 'cash.py')
file_cap = os.path.join(path_use, 'cap.py')
exec(open(file_cash).read())
exec(open(file_cap).read())
return
if __name__ == '__main__':
driver(path=r'c:\users\[username]')
I would like to set path=r'c:\users\[username]' and then pass that to cash.py and cap.py.
Instead of trying to replicate the behaviour of the import statement, you should directly import these subscripts and pass the values you need them to use as function / method aruments. To import a script from a specific path, you can use importlib.import(), like this:
main.py
import os
def driver(path: str):
path_use = os.path.join(path, 'docs', 'analysis', 'forecast')
file_cash = os.path.join(path_use, 'cash.py')
file_cap = os.path.join(path_use, 'cap.py')
importlib.import(file_cash)
importlib.import(file_cap)
cash.cash("some_arg")
cap.cap("some_other_arg")
if __name__ == '__main__':
driver(path=r'c:\users\[username]')
I want to use Luigi to manage workflows in Openstack. I am new to Luigi. For the starter, I just want to authenticate myself to Openstack and then fetch image list, flavor list etc using Luigi. Any help will be appreciable.
I am not good with python but I tried below code. I am also not able to list images. Error: glanceclient.exc.HTTPNotFound: The resource could not be found. (HTTP 404)
import luigi
import os_client_config
import glanceclient.v2.client as glclient
from luigi.mock import MockFile
import sys
import os
def get_credentials():
d = {}
d['username'] = 'X'
d['password'] = 'X'
d['auth_url'] = 'X'
d['tenant_name'] = 'X'
d['endpoint'] = 'X'
return d
class LookupOpenstack(luigi.Task):
d =[]
def requires(self):
pass
def output(self):
gc = glclient.Client(**get_credentials())
images = gc.images.list()
print("images", images)
for i in images:
print(i)
return MockFile("images", mirror_on_stderr=True)
def run(self):
pass
if __name__ == '__main__':
luigi.run(["--local-scheduler"], LookupOpenstack())
The general approach to this is just write python code to perform the tasks you want using the OpenStack API. https://docs.openstack.org/user-guide/sdk.html It looks like the error you are getting is addressed on the OpenStack site. https://ask.openstack.org/en/question/90071/glanceclientexchttpnotfound-the-resource-could-not-be-found-http-404/
You would then just wrap this code in luigi Tasks as appropriate- there's nothing special about doing with this OpenStack, except that you must define the output() of your luigi tasks to match up with an output that indicates the task is done. Right now it looks like the work is being done in the output() method, which should be in the run() method, the output method should just be what to look for to indicate that the run() method is complete so it doesn't run() when required by another task if it is already done.
It's really impossible to say more without understanding more details of your workflow.
I have a python script that does some updates on my database.
The files that this script needs are saved in a directory at around 3AM by some other process.
So I'm going to schedule a cron job to run daily at 3AM; but I want to handle the case if the file is not available exactly at 3AM, it could be delayed by some interval.
So I basically need to keep checking whether the file of some particular name exists every 5 minutes starting from 3AM. I'll try for around 1 hour, and give up if it doesn't work out.
How can I achieve this sort of thing in Python?
Try something like this (you'll need to change the print statements to be function calls if you are using Python 3).
#!/usr/bin/env python
import os
import time
def watch_file( filename, time_limit=3600, check_interval=60 ):
'''Return true if filename exists, if not keep checking once every check_interval seconds for time_limit seconds.
time_limit defaults to 1 hour
check_interval defaults to 1 minute
'''
now = time.time()
last_time = now + time_limit
while time.time() <= last_time:
if os.path.exists( filename ):
return True
else:
# Wait for check interval seconds, then check again.
time.sleep( check_interval )
return False
if __name__ == '__main__':
filename = '/the/file/Im/waiting/for.txt'
time_limit = 3600 # one hour from now.
check_interval = 60 # seconds between checking for the file.
if watch_file( filename, time_limit, check_interval ):
print "File present!"
else:
print "File not found after waiting:", time_limit, " seconds!"
For this sort of task, you need to use watchdog a library for listening to and monitoring system events.
One of the events it can monitor is file system events, via the FileSystemEventHandler class, which has on_created() method.
You'll end up writing a "wrapper" script, it can be running continuously. This script will use watchdog to listen on that particular directory. The moment a file is created, this script will be notified - you'll have to then check if the file created matches the pattern of the target file, and then execute your custom code.
Luckily, as this is a common task - there is a PatternMatchingEventHandler already available, which inherits from FileSystemEventHandler but watches for files matching a pattern.
Your wrapper script then becomes:
from watchdog.observers import Observer
from watchdog.events import PatternMatchingEventHandler
class FileWatcher(PatternMatchingEventHandler):
patterns = ["*.dat"] # adjust as required
def process(self, event):
# your actual code goes here
# event.src_path will be the full file path
# event.event_type will be 'created', 'moved', etc.
print('{} observed on {}'.format(event.event_type, event.src_path))
def on_created(self, event):
self.process(event)
if __name__ == '__main__':
obs = Observer() # This is what manages running of your code
obs.schedule(FileWatcher(), path='/the/target/dir')
obs.start() # Start watching
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
ob.stop()
obs.join()
That is what comes to my mind first, pretty straight forward:
from time import sleep
counter = 0
working = True
while counter < 11 and working:
try:
# Open file and do whatever you need
working = False
except IOError:
counter +=1
sleep(5*60)
Better solution
from time import sleep
counter = 0
working = True
while counter < 11 and working:
if os.path.isfile('path/to/your/file')
# Open file and do whatever you need
working = False
else:
counter +=1
sleep(5*60)
In Python you can check if the file exists
import os.path
os.path.isfile(filename)
Then you set your cron to run every 5 minutes from 3am:
*/5 3 * * * /path-to-your/script.py
You can write in a simple file to control wether you already read the data from file or not (or a database if you are already using one)
You can use Twisted, and it is reactor it is much better than an infinite loop ! Also you can use reactor.callLater(myTime, myFunction), and when myFunction get called you can adjust the myTime and add another callback with the same API callLater().
Let's say I create this simple module and call it MyModule.py:
import threading
import multiprocessing
import time
def workerThreaded():
print 'thread working...'
time.sleep(2)
print 'thread complete'
def workerProcessed():
print 'process working...'
time.sleep(2)
print 'process complete'
def main():
workerThread = threading.Thread(target=workerThreaded)
workerThread.start()
workerProcess = multiprocessing.Process(target=workerProcessed)
workerProcess.start()
workerThread.join()
workerProcess.join()
if __name__ == '__main__':
main()
And then I throw this together to unit test it:
import unittest
import MyModule
class MyModuleTester(unittest.TestCase):
def testMyModule(self):
MyModule.main()
unittest.main()
(I know this isn't a good unit test because it doesn't actually TEST it, it just runs it, but that's not relevant to my question)
If I run this unit test in PyCharm with code coverage, then it only shows the code inside the workerThreaded() and main() functions as being covered, even though it clearly covers the workerProcessed() function as well.
How do I get PyCharm to include code that was started in a new process process in its code coverage? Also, how can I get it to include the if __name__ == '__main__': block as well?
I'm running PyCharm 2.7.3, as well as Python 2.7.3.
Coverage.py can measure code run in subprocesses, details are at http://nedbatchelder.com/code/coverage/subprocess.html
I managed to make it work with subprocesses, not sure if this will work with threads or with python 2.
Create .covergerc file in your project root
[run]
concurrency=multiprocessing
Create sitecustomize.py file in your project root
import atexit
from glob import glob
import os
from functools import partial
from shutil import copyfile
from tempfile import mktemp
def combine_coverage(coverage_pattern, xml_pattern, old_coverage, old_xml):
from coverage.cmdline import main
# Find newly created coverage files
coverage_files = [file for file in glob(coverage_pattern) if file not in old_coverage]
xml_files = [file for file in glob(xml_pattern) if file not in old_xml]
if not coverage_files:
raise Exception("No coverage files generated!")
if not xml_files:
raise Exception("No coverage xml file generated!")
# Combine all coverage files
main(["combine", *coverage_files])
# Convert them to xml
main(["xml"])
# Copy combined xml file over PyCharm generated one
copyfile('coverage.xml', xml_files[0])
os.remove('coverage.xml')
def enable_coverage():
import coverage
# Enable subprocess monitoring by providing rc file and enable coverage collecting
os.environ['COVERAGE_PROCESS_START'] = os.path.join(os.path.dirname(__file__), '.coveragerc')
coverage.process_startup()
# Get current coverage files so we can process only newly created ones
temp_root = os.path.dirname(mktemp())
coverage_pattern = '%s/pycharm-coverage*.coverage*' % temp_root
xml_pattern = '%s/pycharm-coverage*.xml' % temp_root
old_coverage = glob(coverage_pattern)
old_xml = glob(xml_pattern)
# Register atexit handler to collect coverage files when python is shutting down
atexit.register(partial(combine_coverage, coverage_pattern, xml_pattern, old_coverage, old_xml))
if os.getenv('PYCHARM_RUN_COVERAGE'):
enable_coverage()
This basically detects if the code is running in PyCharm Coverage and collects newly generated coverage files. There are multiple files, one for the main process and one for each subprocess. So we need to combine them with "coverage combine" then convert them to xml with "coverage xml" and copy the resulted file over PyCharm's generated xml file.
Note that if you kill the child process in you tests coverage.py will not write the data file.
It does not require anything else just hit "Run unittests with Coverage" button in PyCharm.
That's it.
I have a python script that sets up several gearman workers. They call into some methods on SQLAlchemy models I have that are also used by a Pylons app.
Everything works fine for an hour or two, then the MySQL thread gets lost and all queries fail. I cannot figure out why the thread is getting lost (I get the same results on 3 different servers) when I am defining such a low value for pool_recycle. Also, why wouldn't a new connection be created?
Any ideas of things to investigate?
import gearman
import json
import ConfigParser
import sys
from sqlalchemy import create_engine
class JSONDataEncoder(gearman.DataEncoder):
#classmethod
def encode(cls, encodable_object):
return json.dumps(encodable_object)
#classmethod
def decode(cls, decodable_string):
return json.loads(decodable_string)
# get the ini path and load the gearman server ips:ports
try:
ini_file = sys.argv[1]
lib_path = sys.argv[2]
except Exception:
raise Exception("ini file path or anypy lib path not set")
# get the config
config = ConfigParser.ConfigParser()
config.read(ini_file)
sqlachemy_url = config.get('app:main', 'sqlalchemy.url')
gearman_servers = config.get('app:main', 'gearman.mysql_servers').split(",")
# add anypy include path
sys.path.append(lib_path)
from mypylonsapp.model.user import User, init_model
from mypylonsapp.model.gearman import task_rates
# sqlalchemy setup, recycle connection every hour
engine = create_engine(sqlachemy_url, pool_recycle=3600)
init_model(engine)
# Gearman Worker Setup
gm_worker = gearman.GearmanWorker(gearman_servers)
gm_worker.data_encoder = JSONDataEncoder()
# register the workers
gm_worker.register_task('login', User.login_gearman_worker)
gm_worker.register_task('rates', task_rates)
# work
gm_worker.work()
I've seen this across the board for Ruby, PHP, and Python regardless of DB library used. I couldn't find how to fix this the "right" way which is to use mysql_ping, but there is a SQLAlchemy solution as explained better here http://groups.google.com/group/sqlalchemy/browse_thread/thread/9412808e695168ea/c31f5c967c135be0
As someone in that thread points out, setting the recycle option to equal True is equivalent to setting it to 1. A better solution might be to find your MySQL connection timeout value and set the recycle threshold to 80% of it.
You can get that value from a live set by looking up this variable http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_connect_timeout
Edit:
Took me a bit to find the authoritivie documentation on useing pool_recycle
http://www.sqlalchemy.org/docs/05/reference/sqlalchemy/connections.html?highlight=pool_recycle