File atomicity with luigi python library - python

Do I need to worry about file atomicity in luigi with the following code, picking a dataframe and returning it as an output from a task? I don't get the atomicity part, as I would hope luigi would just wait for the task to complete writing a file before stating the task is complete.
class readSQLtoPickle(luigi.Task):
sql = luigi.Parameter()
pickle = luigi.Parameter()
def output(self):
return luigi.LocalTarget(self.pickle,format=format.Nop)
def run(self):
data = pd.read_sql(self.sql, ariel)
with self.output().open('w') as f:
pickle.dump(data, f)
class grabData(luigi.Task): # standard Luigi Task class
sql = luigi.Parameter(default="SELECT * FROM DIM_DRUG_PRODUCT")
pickle = luigi.Parameter(default="drug_product.pkl")
def requires(self):
# we need to read the log file before we can process it
return readSQLtoPickle(sql=self.sql, pickle=self.pickle)
def run(self):
with self.input().open('r') as f:
df = pickle.load(f)
print(type(df))
print(df.head(100))
print(len(df))

Writing to LocalTarget is atomic. Behind the scene lugi first writes to a temp file and then moves the temp file to your actual target. Look for atomic_file in the source code
I don't get the atomicity part, as I would hope luigi would just wait for the task to complete writing a file before stating the task is complete.
If you use a local scheduler to run your task (--local-scheduler) and have only one worker, then you should be fine.
It becomes a problem if you have several workers working on the same tasks and are trying to identity which tasks are now available to run.
In your example one worker could be trying to check if grabData is ready to run, and see that the file is available while another worker is in the middle of readSQLtoPickle writing on the file.

Related

Is it possible to write a luigi wrapper task that tolerates failed sub-tasks?

I have a luigi task that performs some non-stable computations. Think of an optimization process that sometimes does not converge.
import luigi
MyOptimizer(luigi.Task):
input_param: luigi.Parameter()
output_filename = luigi.Parameter(default='result.json')
def run(self):
optimize_something(self.input_param, self.output().path)
def output(self):
return luigi.LocalTarget(self.output_filename)
Now I would like to build a wrapper task that will run this optimizer several times, with different input parameters, and will take the output of the first run that converged.
The way I am implementing it now is by not using MyOptimizer because if it fails, luigi will think that also the wrapper task has failed, but I am okay with some instances of MyOptimizer failing.
MyWrapper(luigi.Task):
input_params_list = luigi.ListParameter()
output_filename = luigi.Parameter(default='result.json')
def run(self):
for input_param in self.input_params_list:
try:
optimize_something(self.input_param, self.output().path)
print(f"Optimizer succeeded with input {input_param}")
break
except Exception as e:
print(f"Optimizer failed with input {input_param}. Trying again...")
def output(self):
return luigi.LocalTarget(self.output_filename)
The problem is that this way, the tasks are not parallelized. Also, you can imagine MyOptimizer and optimize_something are complex tasks that also participate in the data-pipeline handled by luigi, which creates pretty much chaos in my code.
I would appreciate any insights and ideas on how to make this work in a luigi-like fashion :)
Can you make is so that your Optimizer always writes something out? Even if it's an empty file to signify failure but which will look successful to luigi? Also, use the input_param in the MyOptimizer's output filename to make the filenames unique.
Then:
MyWrapper(luigi.Task):
input_params_list = luigi.ListParameter()
output_filename = luigi.Parameter(default='result.json')
def run(self):
task_list = [MyOptimizer(input_param) for input_param in self.input_params_list]
targets = yield task_list #executes tasks in parallel
for target in targets:
...do something to read and compare outputs
some_data = some_read(target.path)
...write optimal solution
def output(self):
return luigi.LocalTarget(self.output_filename)

Global file initialization in flask- python

I have been googling to find how to create a global file, which will open till my application is completed . Need to write the output of all modules in a view in single file. so that users can download this file as a report once application is completed running from Front end. This is the class I have created
import time
class FileOperations:
def __init__(self):
self.current_time = time.strftime('%Y-%m-%d_%H-%M-%S')
self.outfile = open("reports/username_" + self.current_time + ".txt", 'w')
self.outfile.write("Final Report \n")
self.outfile.write("*****************")
I want this file to get it generated when the application start running & should be available for all modules
A context manager is a way to safely handle operations such as writing to file. It also allows you to better trace when file opens or closes.
I suggest you take the time when the application starts, and reuse that file as I take it you intended. That's probably "safer" than keeping the file open.
def get_time():
global start_time
start_time = time.strftime('%Y-%m-%d_%H-%M-%S')
def write_to_file():
with open('reports/username_{}.txt'.format(start_time), 'a') as f:
f.write("Final Report \n")
f.write("*****************")
if 'start_time' not in globals():
get_time()
The conditional will run each time the module is imported. By checking if its already defined in the module scope, we make sure to only define it once.

Multiprocessing Pipe() not working

I am trying to get familiar with the multiprocessing module. I am currently having some issues with Pipe(). I devised a small example to illustrate my problem.
I wrote two functions:
One that creates files in a specific folder (spawner)
And another that detects these files and copies them to another folder (cleaner)
They both work fine. I also managed to create a Process for both so that the creation and copying of the files happens simultaneously.
For the next step, I want the spawner to communicate to the cleaner that it has finished creating files so that the latter can terminate.
Here is the code:
import os
from time import sleep
import multiprocessing as mp
from shutil import copy2
def spawner(f_folder, pipeEnd):
template = 'my_file{}.txt'
for i in range(10):
new_file = os.path.join(f_folder, template.format(str(i)))
with open(new_file, 'w'):
pass
sleep(1)
pipeEnd.send(True)
return
def cleaner(f_folder, t_folder, pipeEnd):
state = set()
while not pipeEnd.recv():
new_files = set(os.listdir(f_folder)).difference(state)
state = set(os.listdir(f_folder))
for file in new_files:
copy2(os.path.join(f_folder, file), t_folder)
sleep(3)
return
if __name__ == '__main__':
receiver, sender = mp.Pipe()
from_folder = r'C:\Users\evkouni\Desktop\TEMP\PythonTests\subProcess\from'
to_folder = r'C:\Users\evkouni\Desktop\TEMP\PythonTests\subProcess\to'
p = mp.Process(target=spawner, args=(from_folder, sender))
q = mp.Process(target=cleaner, args=(from_folder, to_folder, receiver))
p.start()
q.start()
I just cannot seem to be able to get it to work.. Any help would be appreciated.
A Pipe is the wrong solution to your problem. You could use a pipe if you wanted to pass the file names from the spawner to the cleaner, but what you are trying to do is raise a flag. For that purpose, I would recommend the use of an Event: https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Event
This can be considered a thread-safe (and multiprocess-safe) boolean. You would use it like
finished = mp.Event()
...
finished.set() # pipeEnd.send(True)
...
while not finished.is_set(): # while not receiver.recv():

Django server crashes with exit codes 139, 77

Foreword
Okay, I have a really complex perfomance issue. I'm building a content managment system and one of the features should be generating tons of .docx files with different templates. I started with Webodt + Abiword. But then templates got too complex, so I had to swith my backend to Templated-docs + LibreOffice. So this is where my problems started.
I use:
Python 2.7.12
Django==1.8.2
templated-docs==0.2.9
LibreOffice 5.1.5.2
Ubuntu 16.04
The actual problem
I have an API which handles .docx render. I will show one of views, as an example, they are pretty similar:
#permission_classes((permissions.IsAdminUser,))
class BookDocxViewSet(mixins.RetrieveModelMixin, viewsets.GenericViewSet):
def retrieve(self, request, *args, **kwargs):
queryset = Pupils.objects.get(id=kwargs['pk'])
serializer = StudentSerializer(queryset)
context = dict(serializer.data)
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'docs/books/%s/%s_%s_%s.doc' % (datetime.now().date(), context[u'surname'], context[u'name'], datetime.now().date())
with open(doc, 'rb') as f:
content = f.read()
path = default_storage.save(p, ContentFile(content))
f.close()
return response.Response(u'/media/' + path)
When I call it the first time, it creates a .docx file, saves it to my default_storage and then returns me a download link. But when I try to do it again, of do it with another method (which works with another template and context), my server just crashes without any logs. The last thing I see is either
Process finished with exit code 77 if I call it with a little delay (more then one second)
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) if call my method for the second time right away (in less than one second)
I tried to use debuger -- it said that my server crashes on this line:
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
I bet what happens is:
When I call my method the first time templated_docs starts LibreOffice backend, and then does not stop it
When I call my method the second time templated_docs tries to start LibreOffice backend again, but it is already busy.
Questions
How do I debug LibreOffice to prove / refute my theory? (I guess I need to debug templated_docs instead)
Why do I get different exit codes depending of delay?
Is it enough base to oppen an issue on GitHub?
How do I fix that?
UPD
It is not an issue of REST Framework or not using FileResponce().
I already tried to test it with regular view.
def get_document(request, *args, **kwargs):
context = Pupils.objects.get(id=kwargs['pk']).__dict__
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'%s_%s_%s' % (context[u'surname'], context[u'name'], datetime.now().date())
return FileResponse(doc, p)
And the problem is same.
UPD 2
Okay. This line is chashing my server:
# pylokit/lokit.py
self.lokit = lo.libreofficekit_hook(six.b(lo_path))
Okay, that was a bug in templated_docs. I was right, it happens because templated_docs is trying to start LibreOffice twice. As it said in pylokit documentation:
The use of _exit() instead of default exit() is required because in
some circumstances LibreOffice segfaults on process exit.
It means the process that used pylockt should be killed after. But we cannot kill Django server. So I decided to use multiprocessing:
# templated_docs/__init__.py
if source_extension[1:] != output_format:
lo_path = getattr(
settings,
'TEMPLATED_DOCS_LIBREOFFICE_PATH',
'/usr/lib/libreoffice/program/')
def f(conn):
with Office(lo_path) as lo:
conv_file = NamedTemporaryFile(delete=False,
suffix='.%s' % output_format)
with lo.documentLoad(str(dest_file.name)) as doc:
doc.saveAs(conv_file.name)
os.unlink(dest_file.name)
conn.send(conv_file.name)
conn.close()
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
conv_file_name = parent_conn.recv()
p.join()
return conv_file_name
else:
return dest_file.name
I oppened an issue and made a pull request.

Python multiprocessing pool hangs on map call

I have a function that parses a file and inserts the data into MySQL using SQLAlchemy. I've been running the function sequentially on the result of os.listdir() and everything works perfectly.
Because most of the time is spent reading the file and writing to the DB, I wanted to use multiprocessing to speed things up. Here is my pseduocode as the actual code is too long:
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
The problem I'm seeing is that the script hangs and never finishes. I usually get 48 of 63 records into the database. Sometimes it's more, sometimes it's less.
I've tried using pool.close() and in combination with pool.join() and neither seems to help.
How do I get this script to complete? What am I doing wrong? I'm using Python 2.7.8 on a Linux box.
You need to put all code which uses multiprocessing, inside its own function. This stops it recursively launching new pools when multiprocessing re-imports your module in separate processes:
def parse_file(filename):
...
def main():
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
if __name__ == '__main__':
main()
See the documentation about making sure your module is importable, also the advice for running on Windows(tm)
The problem was a combination of 2 things:
my pool code being called multiple times (thanks #Peter Wood)
my DB code opening too many sessions (and/or) sharing sessions
I made the following changes and everything works now:
Original File
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session = get_session() # see below
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
DB File
def get_session():
engine = create_engine('mysql://root:root#localhost/my_db')
Base.metadata.create_all(engine)
Base.metadata.bind = engine
db_session = sessionmaker(bind=engine)
return db_session()

Categories

Resources