I am using Luigi to run several tasks, and then I need to bulk transfer the output to a standardized file location. I've written a WrapperTask with an overridden complete() method to do this:
from luigi.task import flatten
class TaskX(luigi.WrapperTask):
date = luigi.DateParameter()
client = luigi.s3.S3Client()
def requires(self):
yield TaskA(date=self.date)
yield TaskB(date=self.date)
def complete(self):
tasks_complete = all(r.complete() for r in flatten(self.requires()))
## at the end of everything, batch copy the files
if tasks_complete:
self.client.copy('current-old', 'current')
return True
else:
return False
if __name__ == "__main__":
luigi.run()
but I'm having trouble getting conditional part of complete() to be called when the process is actually finished.
I assume this is because of asynchronous behavior pointed out by others, but I'm not sure how to fix it.
I've tried running Luigi with these command-line parameters:
$ PYTHONPATH="" luigi --module x TaskX --worker-retry-external-task
But that doesn't seem to be working correctly. Is this the right approach to handle this type of task?
Also, I'm curious — has anyone had experience with the --worker-retry-external-task command? I'm having some trouble understanding it.
In the source code,
def _is_external(task):
return task.run is None or task.run == NotImplemented
is called to determine whether or not a LuigiTask has a run() method, which a WrapperTask does not. Thus, I'd expect the --retry-external-task flag to retry complete() for this until it's complete, thus performing the action. However, just playing around in the interpreter leads me to believe that:
>>> import luigi_newsletter_process
>>> task = luigi_newsletter_process.Newsletter()
>>> task.run
<bound method Newsletter.run of Newsletter(date=2016-06-22, use_s3=True)>
>>> task.run()
>>> task.run == None
False
>>> task.run() == None
True
This code snippet is not doing what it thinks it is.
Am I off-base here?
I still think that overriding .complete() should in theory have been able to do this, and I'm still not sure why it's not, but if you're just looking for a way to bulk-transfer files after running a process, a workable solution is just to have the transfer take place within a .run() method:
def run(self):
logger.info('transferring into current directory')
self.client.copy('current-old','current')
Related
I have a Python class that inherits from Popen:
class S(Popen):
def exit(self):
self.stdin.close()
return self.wait()
This works fine, except that if I call the exit() method on my Python unit test (using the built-in 'unittest' framework), the following error comes up when running the test:
/usr/lib/python3.5/unittest/case.py:600: ResourceWarning: unclosed
file <_io.TextIOWrapper name=5 encoding='UTF-8'> testMethod()
Here's the test code:
class TestS(unittest.TestCase):
def test_exit(self):
s = S()
self.assertTrue(s.exit() == 0)
I know it's triggered by the return self.wait() line because there are no other files being opened and if it's replaced by return 0 the warning goes away.
Is there something else that needs to be done for proper clean-up? Perhaps something equivalent to pclose() in C? Found a similar question helpful but it doesn't really help solve this issue. The test passes, but I'd rather not suppress the warning, without understanding the cause.
Some things I already tried, with no success:
Did a with S() as s block
Same as above with self.exit() being called by a destructor (def __exit__)
Thanks in advance!
I believe the warning might refer to the stdout/stderr of the subprocess, particularly if you were using subprocess.PIPE for either of them.
I had the same issue myself, and it went away after adding calls to to proc.stdout.close() and proc.stderr.close() after wait returns.
I've acquired some code that I need to test before refactoring. It uses deep recursion so sets new limits and then runs itself in a fresh thread:
sys.setrecursionlimit(10**6)
threading.stack_size(2**27)
...
threading.Thread(target=main).start()
The code relies heavily on sys.stdin and sys.stdout e.g.
class SpamClass:
def read(self):
self.n = int(sys.stdin.readline())
...
for i in range(self.n):
[a, b, c] = map(int, sys.stdin.readline().split())
...
def write(self)
print(" ".join(str(x) for x in spam()))
To test the code, I need to pass in the contents of a series of input files and compare the results to the contents of some corresponding sample output files.
So far, I've tried three or four different types of mocking and patching without success. My other tests are all written for pytest, so it would be a real nuisance to have to use something else.
I've tried patching module.sys.stdin with StringIO, which doesn't seem to work because pytest's capsys sets sys.stdin to null and hence throws an error despite the patch.
I've also tried using pytest's monkeypatch fixture to replace the module.SpamClss.read method with a function defined in the test, but that produces a segmentation error due, I think, to the thread exiting before the test (or …?).
'pytest test_spam.py' terminated by signal SIGBUS (Misaligned address error)
Any suggestions for how to do this right? Many thanks.
Well, I still don't know what the problem was or if I'm doing this right, but it works for now. I'm not confident the threading aspect is working correctly, but the rest seems fine.
#pytest.mark.parametrize("inputs, outputs", helpers.get_sample_tests('spampath'))
def test_tree_orders(capsys, inputs, outputs):
"""
"""
with patch('module.sys.stdin', StringIO("".join(inputs))):
module.threading.Thread(target=module.main()).start()
captured = capsys.readouterr()
assert "".join(outputs) == captured.out
For anyone else who's interested, it helps to do your debugging prints as print(spam, file=sys.stderr), which you can then access in the test as captured.err, cf. the captured.out used for testing.
I'm using Twisted along with Txmongo lib.
In the following function, I want to invoke cancelTest() 5 secs later. But the code does not work. How can I make it work?
from twisted.internet import task
def diverge(self, d):
if d == 'Wait':
self.flag = 1
# self.timeInit = time.time()
clock = task.Clock()
for ip in self.ips:
if self.factory.dictQueue.get(ip) is not None:
self.factory.dictQueue[ip].append(self)
else:
self.factory.dictQueue[ip] = deque([self])
# self.factory.dictQueue[ip].append(self)
log.msg("-----------------the queue after wait")
log.msg(self.factory.dictQueue)
###############################HERE, this does not work
self.dtime = task.deferLater(clock, 5, self.printData)
#############################
self.dtime.addCallback(self.cancelTest)
self.dtime.addErrback(log.err)
else:
self.cancelTimeOut()
d.addCallback(self.dispatch)
d.addErrback(log.err)
def sendBackIP(self):
self.ips.pop(0)
log.msg("the IPs: %s" % self.ips)
d = self.factory.service.checkResource(self.ips)
d.addCallback(self.diverge) ###invoke above function
log.msg("the result from checkResource: ")
log.msg()
In general reactor.callLater() is the function you want. So if the function needs to be called 5 seconds later, your code would look like this:
from twisted.internet import reactor
reactor.callLater(5, cancelTest)
One thing that is strange is that your task.deferLater implementation should also work. However without seeing more of your code I don't think I can help you more other than stating that it's strange :)
References
https://twistedmatrix.com/documents/current/core/howto/defer.html#callbacks
http://twistedmatrix.com/documents/current/api/twisted.internet.base.ReactorBase.html#callLater
you're doing almost everything right; you just didn't get the Clock part correctly.
twisted.internet.task.Clock is a deterministic implementation of IReactorTime, which is mostly used in unit/integration testing for getting a deterministic output from your code; you shouldn't use that in production.
So, what should you use in production? reactor! In fact, all production reactor implementations implement the IReactorTime interface.
Just use the following import and function call:
from twisted.internet import reactor
# (omissis)
self.dtime = task.deferLater(reactor, 5, self.printData)
Just some sidenotes:
in your text above the snippet, you say that you want to invoke cancelTest after five seconds, but in the code you actually invoke printData; of course if printData just prints something, doesn't raise and returns an immediate value, this will cause the cancelTest function to be executed immediately after since it's a chained callcack; but if you want to actually be 100% sure, you should call cancelTest within deferLater, not printData.
Also, I don't understand if this is a kind of "timeout"; please be advised that such callback will be triggered in all situations, even if the tests take less than five seconds. If you need a cancelable task, you should use reactor.callLater directly; that will NOT return a deferred you can use, but will let you cancel the scheduled call.
When I use the hide("everything") context manager, and a fabric task fails, I still get a message. The docs read:
everything: Includes warnings, running, user and output (see above.) Thus, when turning off everything, you will only see a bare minimum of output (just status and debug if it’s on), along with your own print statements.
But this is not strictly true, right? -- I see status, debug, and abort messages.
If I really do want to hide everything, is there a better way than:
with hide("aborts"), hide("everything"):
...
When in doubt, look at the source:
https://github.com/fabric/fabric/blob/master/fabric/context_managers.py#L98
here is the actual declaration. everything is pretty much everything: warnings, running, user, output, exceptions
https://github.com/fabric/fabric/blob/master/fabric/state.py#L411
It's just a nice wrapper around output. Frankly i would stick to their build-in decorators since that has less chances of changing, plus you get the added value of more pythonic-readable code:
#task
def task1():
with hide('running', 'stdout', 'stderr'):
run('ls /var/www')
....
vs.
#task
def task1():
output['running'] = False
output['stdout'] = False
output['stderr'] = False
# or just output['everything'] = False
run('ls /var/www')
....
BUT, at the end of the day its the same thing.
This is what I have always used:
from fabric.state import output
output['everything'] = False
I am trying to quit a python program by calling sys.exit() but it does not seem to be working.
The program structure is something like:
def func2():
*does some scraping operations using scrapy*
def func1():
Request(urls, callbakc=func2)
So, here, func1 is requesting a list of URLs and the callback method, func2 is being called. I want to quit the execution of the program if something goes wrong in func2
On checking the type of the object in func1 I found its and http.Request object.
Also, since I am using scrapy, whenever I call sys.exit() in func2, the next url in the list is called and the program execution continues.
I have also tried to use a global variable to stop the execution but to no avail.
Where am I going wrong?
According to the How can I instruct a spider to stop itself?, you need to raise CloseSpider exception:
raise CloseSpider('Done web-scraping for now')
Also see:
Running Scrapy tasks in Python
sys.exit() would not work here since Scrapy is based on twisted.
Even if we don't know how to completely stop, Python's mutable-object default binding "gotcha" can help us skip all callbacks from a certain point on.
Here is what you can do:
First, create a function generating wrapping other callback functions with condition. It's second argument cont is going to be bound to a mutable object (list) so we can affect all callbacks after creating them.
def callback_gen(f, cont=[True]):
def c(response):
if cont[0]:
f(response, cont=cont)
else:
print "skipping" # possibly replace with pass
return c
Now make some testing functions:
def func2(response, cont=None):
print response
print cont
# this should prevent any following callback from running
cont[0]=False
def func3(response, cont=None):
print response
print cont
And now create two callbacks the first one is func2 which prevents the following ones from running.
f2 = callback_gen(func2)
f3 = callback_gen(func3)
f2("func2")
f3("func3")
I like it :)