Foreword
Okay, I have a really complex perfomance issue. I'm building a content managment system and one of the features should be generating tons of .docx files with different templates. I started with Webodt + Abiword. But then templates got too complex, so I had to swith my backend to Templated-docs + LibreOffice. So this is where my problems started.
I use:
Python 2.7.12
Django==1.8.2
templated-docs==0.2.9
LibreOffice 5.1.5.2
Ubuntu 16.04
The actual problem
I have an API which handles .docx render. I will show one of views, as an example, they are pretty similar:
#permission_classes((permissions.IsAdminUser,))
class BookDocxViewSet(mixins.RetrieveModelMixin, viewsets.GenericViewSet):
def retrieve(self, request, *args, **kwargs):
queryset = Pupils.objects.get(id=kwargs['pk'])
serializer = StudentSerializer(queryset)
context = dict(serializer.data)
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'docs/books/%s/%s_%s_%s.doc' % (datetime.now().date(), context[u'surname'], context[u'name'], datetime.now().date())
with open(doc, 'rb') as f:
content = f.read()
path = default_storage.save(p, ContentFile(content))
f.close()
return response.Response(u'/media/' + path)
When I call it the first time, it creates a .docx file, saves it to my default_storage and then returns me a download link. But when I try to do it again, of do it with another method (which works with another template and context), my server just crashes without any logs. The last thing I see is either
Process finished with exit code 77 if I call it with a little delay (more then one second)
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) if call my method for the second time right away (in less than one second)
I tried to use debuger -- it said that my server crashes on this line:
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
I bet what happens is:
When I call my method the first time templated_docs starts LibreOffice backend, and then does not stop it
When I call my method the second time templated_docs tries to start LibreOffice backend again, but it is already busy.
Questions
How do I debug LibreOffice to prove / refute my theory? (I guess I need to debug templated_docs instead)
Why do I get different exit codes depending of delay?
Is it enough base to oppen an issue on GitHub?
How do I fix that?
UPD
It is not an issue of REST Framework or not using FileResponce().
I already tried to test it with regular view.
def get_document(request, *args, **kwargs):
context = Pupils.objects.get(id=kwargs['pk']).__dict__
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'%s_%s_%s' % (context[u'surname'], context[u'name'], datetime.now().date())
return FileResponse(doc, p)
And the problem is same.
UPD 2
Okay. This line is chashing my server:
# pylokit/lokit.py
self.lokit = lo.libreofficekit_hook(six.b(lo_path))
Okay, that was a bug in templated_docs. I was right, it happens because templated_docs is trying to start LibreOffice twice. As it said in pylokit documentation:
The use of _exit() instead of default exit() is required because in
some circumstances LibreOffice segfaults on process exit.
It means the process that used pylockt should be killed after. But we cannot kill Django server. So I decided to use multiprocessing:
# templated_docs/__init__.py
if source_extension[1:] != output_format:
lo_path = getattr(
settings,
'TEMPLATED_DOCS_LIBREOFFICE_PATH',
'/usr/lib/libreoffice/program/')
def f(conn):
with Office(lo_path) as lo:
conv_file = NamedTemporaryFile(delete=False,
suffix='.%s' % output_format)
with lo.documentLoad(str(dest_file.name)) as doc:
doc.saveAs(conv_file.name)
os.unlink(dest_file.name)
conn.send(conv_file.name)
conn.close()
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
conv_file_name = parent_conn.recv()
p.join()
return conv_file_name
else:
return dest_file.name
I oppened an issue and made a pull request.
Related
I´m working on a Python program supposed to read incoming MS-Word documents in a client/server fashion, i.e. the client sends a request (one or multiple MS-Word documents) and the server reads specific content from those requests using pythoncom and win32com.
Because I want to minimize waiting time for the client (client needs a status message from server, I do not want to open an MS-Word instance for every request. Hence, I intend to have a pool of running MS-Word instances from which the server can pick and choose. This, in turn, means I have to reuse those instances from the pool in different threads and this is what causes trouble right now.
After I fixed the following error I asked previously on stack overflow, my code looks now like this:
import pythoncom, win32com.client, threading, psutil, os, queue, time, datetime
class WordInstance:
def __init__(self,app):
self.app = app
self.flag = True
appPool = {'WINWORD.EXE': queue.Queue()}
def initAppPool():
global appPool
wordApp = win32com.client.DispatchEx('Word.Application')
appPool["WINWORD.EXE"].put(wordApp) # For testing purpose I only use one MS-Word instance currently
def run_in_thread(instance,appid, path):
print(f"[{datetime.now()}] open doc ... {threading.current_thread().name}")
pythoncom.CoInitialize()
wordApp = win32com.client.Dispatch(pythoncom.CoGetInterfaceAndReleaseStream(appid, pythoncom.IID_IDispatch))
doc = wordApp.Documents.Open(path)
doc.SaveAs(rf'{path}.FB.pdf', FileFormat=17)
doc.Close()
print(f"[{datetime.now()}] close doc ... {threading.current_thread().name}")
instance.flag = True
if __name__ == '__main__':
initAppPool()
pathOfFile2BeRead1 = r'C:\Temp\file4.docx'
pathOfFile2BeRead2 = r'C:\Temp\file5.docx'
#treat first request
wordApp = appPool["WINWORD.EXE"].get(True, 10)
wordApp.flag = False
pythoncom.CoInitialize()
wordApp_id = pythoncom.CoMarshalInterThreadInterfaceInStream(pythoncom.IID_IDispatch, wordApp.app)
readDocjob1 = threading.Thread(target=run_in_thread,args=(wordApp,wordApp_id,pathOfFile2BeRead1), daemon=True)
readDocjob1.start()
appPool["WINWORD.EXE"].put(wordApp)
#wait here until readDocjob1 is done
wait = True
while wait:
try:
wordApp = appPool["WINWORD.EXE"].get(True, 1)
if wordApp.flag:
print(f"[{datetime.now()}] ok appPool extracted")
wait = False
else:
appPool["WINWORD.EXE"].put(wordApp)
except queue.Empty:
print(f"[{datetime.datetime.now()}] error: appPool empty")
except BaseException as err:
print(f"[{datetime.datetime.now()}] error: {err}")
wordApp.flag = False
openDocjob2 = threading.Thread(target=run_in_thread,args=(wordApp,wordApp_id,pathOfFile2BeRead2), daemon=True)
openDocjob2.start()
When I run the script I receive the following output printed on the terminal:
[2022-03-29 11:41:08.217678] open doc ... Thread-1
[2022-03-29 11:41:10.085999] close doc ... Thread-1
[2022-03-29 11:41:10.085999] ok appPool extracted
[2022-03-29 11:41:10.085999] open doc ... Thread-2
Process finished with exit code 0
And only the first word file is converted to a pdf. It seems like def run_in_thread terminates after the print statement and before/during pythoncom.CoInitialize(). Sadly I do not receive any error message which makes it quite hard to understand the cause of this behavior.
After reading into Microsofts documentation I tried using
pythoncom.CoInitializeEx(pythoncom.APARTMENTTHREADED) instead of pythoncom.CoInitialize(). Since my COM object needs to be called by multiple threads. However this changed nothing.
I been writing markdown files lately, and have been using the awesome table of content generator (github-markdown-toc) tool/script on a daily basis, but I'd like it to be regenerated automatically each time I press Ctrl+s, right before saving the md file in my sublime3 environment.
What I have done till now was to generate it from the shell manually, using:
gh-md-toc --insert my_file.md
So I wrote a simple plugin, but for some reason I can't see the result I wanted.
I see my print but the toc is not generated.
Does anybody has any suggestions? what's wrong?
import sublime, sublime_plugin
import subprocess
class AutoRunTOCOnSave(sublime_plugin.EventListener):
""" A class to listen for events triggered by ST. """
def on_post_save_async(self, view):
"""
This is called after a view has been saved. It runs in a separate thread
and does not block the application.
"""
file_path = view.file_name()
if not file_path:
return
NOT_FOUND = -1
pos_dot = file_path.rfind(".")
if pos_dot == NOT_FOUND:
return
file_extension = file_path[pos_dot:]
if file_extension.lower() == ".md": #
print("Markdown TOC was invoked: handling with *.md file")
subprocess.Popen(["gh-md-toc", "--insert ", file_path])
Here's a slightly modified version of your plugin:
import sublime
import sublime_plugin
import subprocess
class AutoRunTOCOnSaveListener(sublime_plugin.EventListener):
""" A class to listen for events triggered by ST. """
def on_post_save_async(self, view):
"""
This is called after a view has been saved. It runs in a separate thread
and does not block the application.
"""
file_path = view.file_name()
if not file_path:
return
if file_path.split(".")[-1].lower() == "md":
print("Markdown TOC was invoked: handling with *.md file")
subprocess.Popen(["/full/path/to/gh-md-toc", "--insert ", file_path])
I changed a couple things, along with the name of the class. First, I simplified your test for determining if the current file is a Markdown document (fewer operations means less room for error). Second, you should include the full path to the gh-md-toc command, as it's possible subprocess.Popen can't find it on the default path.
I figured out, since gh-md-toc is a bash script, I replaced the following line:
subprocess.Popen(["gh-md-toc", "--insert ", file_path])
with:
subprocess.check_call("gh-md-toc --insert %s" % file_path, shell=True)
So now it works well, on each save.
When I use a standard Queue to send samples to a process, everything works fine. However, since my needs are simple, I tried to use a SimpleQueue and for some reason the 'empty' method doesn't work. Here's the details:
Error comes from the consumer process (when sample_queue is Queue, everything works, when sample_queue is SimpleQueue, things break):
def frame_update(i):
while not self.sample_queue.empty():
sample = self.sample_queue.get()
for line in lines:
While executing sample_queue.empty() -- SimpleQueue.empty(), from Python 3.6 on windows (queues.py) we get:
def empty(self):
return not self._poll()
Where self._poll() has been set in init by:
def __init__(self, *, ctx):
self._reader, self._writer = connection.Pipe(duplex=False)
self._rlock = ctx.Lock()
self._poll = self._reader.poll
if sys.platform == 'win32':
self._wlock = None
else:
self._wlock = ctx.Lock()
So I follow the self._reader which is set from connection.Pipe (connection.py):
...
c1 = PipeConnection(h1, writable=duplex)
c2 = PipeConnection(h2, readable=duplex)
Ok, great. The _reader is going to be a PipeConnection and pipe connection has this method:
def _poll(self, timeout):
if (self._got_empty_message or
_winapi.PeekNamedPipe(self._handle)[0] != 0):
return True
return bool(wait([self], timeout))
Alright -- So a couple of questions:
1) Shouldn't the init of SimpleQueue be assigning self.poll to self._reader._poll instead of self._reader.poll? Or am I missing something in the inheritance hierarchy?
2) The PipeConnection _poll routine takes a timeout parameter, so #1 shouldn't work...
*) -- Is there some other binding of PipeConnection _poll that I'm missing?
Am I missing something? I am using Python3.6, Windows, debugging in PyCharm and I follow all the paths and they're in the standard multiprocessing paths. I'd appreciate any help or advice. Thanks!
EDIT: After further review, I can see that PipeConnection is a subclass of _ConnectionBase which does indeed have a 'poll' method and it is bound with a default timeout parameter.
So the question is: When SimpleQueue is initializing and sets
self._poll = self._reader.poll
Why doesn't it go up the class hierarchy to grab that from _ConnectionBase?
After looking at why the Queue type works and why the SimpleQueue doesn't, I found that Queue sets the _poll method 'after_fork' as well as before. SimpleQueue doesn't. By changing the setstate method to add self._poll = self._reader.poll as follows (queues.py, line 338), SimpleQueue works
def __setstate__(self, state):
(self._reader, self._writer, self._rlock, self._wlock) = state
self._poll = self._reader.poll
Seems like a bug to me unless I'm really misunderstanding something. I'll submit a bug report and reference this post. Hope this helps someone!
http://bugs.python.org/issue30301
I want to use pyinotify to watch changes on the filesystem. If a file has changed, I want to update my database file accordingly (re-read tags, other information...)
I put the following code in my app's signals.py
import pyinotify
....
# create filesystem watcher in seperate thread
wm = pyinotify.WatchManager()
notifier = pyinotify.ThreadedNotifier(wm, ProcessInotifyEvent())
# notifier.setDaemon(True)
notifier.start()
mask = pyinotify.IN_CLOSE_WRITE | pyinotify.IN_CREATE | pyinotify.IN_MOVED_TO | pyinotify.IN_MOVED_FROM
dbgprint("Adding path to WatchManager:", settings.MUSIC_PATH)
wdd = wm.add_watch(settings.MUSIC_PATH, mask, rec=True, auto_add=True)
def connect_all():
"""
to be called from models.py
"""
rescan_start.connect(rescan_start_callback)
upload_done.connect(upload_done_callback)
....
This works great when django is run with ''./manage.py runserver''. However, when run as ''./manage.py runfcgi'' django won't start. There is no error message, it just hangs and won't daemonize, probably at the line ''notifier.start()''.
When I run ''./manage.py runfcgi method=threaded'' and enable the line ''notifier.setDaemon(True)'', then the notifier thread is stopped (isAlive() = False).
What is the correct way to start endless threads together with django when django is run as fcgi? Is it even possible?
Well, duh. Never start an own, endless thread besides django. I use celery, where it works a bit better to run such threads.
I have a setup.py script which needs to probe the compiler for certain things like the support for TR1, the presence of windows.h (to add NOMINMAX define), etc. I do these checks by creating a simple program and trying to compile it with Distutils' Compiler class. The presence/lack of errors is my answer.
This works well, but it means that the compiler's ugly error messages get printed to the console. Is there a way to suppress error messages for when the compile function is called manually?
Here is my function which tries to compile the program, which now DOES eliminate the error messages by piping the error stream to a file (answered my own question):
def see_if_compiles(program, include_dirs, define_macros):
""" Try to compile the passed in program and report if it compiles successfully or not. """
from distutils.ccompiler import new_compiler, CompileError
from shutil import rmtree
import tempfile
import os
try:
tmpdir = tempfile.mkdtemp()
except AttributeError:
# Python 2.2 doesn't have mkdtemp().
tmpdir = "compile_check_tempdir"
try:
os.mkdir(tmpdir)
except OSError:
print "Can't create temporary directory. Aborting."
sys.exit()
old = os.getcwd()
os.chdir(tmpdir)
# Write the program
f = open('compiletest.cpp', 'w')
f.write(program)
f.close()
# redirect the error stream to keep ugly compiler error messages off the command line
devnull = open('errors.txt', 'w')
oldstderr = os.dup(sys.stderr.fileno())
os.dup2(devnull.fileno(), sys.stderr.fileno())
#
try:
c = new_compiler()
for macro in define_macros:
c.define_macro(name=macro[0], value=macro[1])
c.compile([f.name], include_dirs=include_dirs)
success = True
except CompileError:
success = False
# undo the error stream redirect
os.dup2(oldstderr, sys.stderr.fileno())
devnull.close()
os.chdir(old)
rmtree(tmpdir)
return success
Here is a function which uses the above to check for the presence of a header.
def check_for_header(header, include_dirs, define_macros):
"""Check for the existence of a header file by creating a small program which includes it and see if it compiles."""
program = "#include <%s>\n" % header
sys.stdout.write("Checking for <%s>... " % header)
success = see_if_compiles(program, include_dirs, define_macros)
if (success):
sys.stdout.write("OK\n");
else:
sys.stdout.write("Not found\n");
return success
Zac's comment spurred me to look a bit more and I found that Mercurial's setup.py script has a working method for his approach. You can't just assign the stream because the change won't get inherited by the compiler process, but apparently Python has our good friend dup2() in the form of os.dup2(). That allows the same OS-level stream shenanigans that we all know and love, which do get inherited to child processes.
Mercurial's function redirects to /dev/null, but to keep Windows compatibility I just redirect to a file then delete it.
Quoth Mercurial:
# simplified version of distutils.ccompiler.CCompiler.has_function
# that actually removes its temporary files.
def hasfunction(cc, funcname):
tmpdir = tempfile.mkdtemp(prefix='hg-install-')
devnull = oldstderr = None
try:
try:
fname = os.path.join(tmpdir, 'funcname.c')
f = open(fname, 'w')
f.write('int main(void) {\n')
f.write(' %s();\n' % funcname)
f.write('}\n')
f.close()
# Redirect stderr to /dev/null to hide any error messages
# from the compiler.
# This will have to be changed if we ever have to check
# for a function on Windows.
devnull = open('/dev/null', 'w')
oldstderr = os.dup(sys.stderr.fileno())
os.dup2(devnull.fileno(), sys.stderr.fileno())
objects = cc.compile([fname], output_dir=tmpdir)
cc.link_executable(objects, os.path.join(tmpdir, "a.out"))
except:
return False
return True
finally:
if oldstderr is not None:
os.dup2(oldstderr, sys.stderr.fileno())
if devnull is not None:
devnull.close()
shutil.rmtree(tmpdir)
Here's a context manager that I recently wrote and found useful, because I was having the same problem with distutils.ccompiler.CCompiler.has_function while working on pymssql. I was going to use your approach (nice, thanks for sharing!) but then I thought that it could be done with less code and would be more general and flexible if I used a context manager. Here's what I came up with:
import contextlib
#contextlib.contextmanager
def stdchannel_redirected(stdchannel, dest_filename):
"""
A context manager to temporarily redirect stdout or stderr
e.g.:
with stdchannel_redirected(sys.stderr, os.devnull):
if compiler.has_function('clock_gettime', libraries=['rt']):
libraries.append('rt')
"""
try:
oldstdchannel = os.dup(stdchannel.fileno())
dest_file = open(dest_filename, 'w')
os.dup2(dest_file.fileno(), stdchannel.fileno())
yield
finally:
if oldstdchannel is not None:
os.dup2(oldstdchannel, stdchannel.fileno())
if dest_file is not None:
dest_file.close()
The context for why I created this is at this blog post. Pretty much the same as yours I think.
This uses code that I borrowed from you to do the redirection (e.g.: os.dup2, etc.), but I wrapped it in a context manager so it's more general and reusable.
I use it like this in a setup.py:
with stdchannel_redirected(sys.stderr, os.devnull):
if compiler.has_function('clock_gettime', libraries=['rt']):
libraries.append('rt')
#Adam I just want to point out that there is /dev/null equivalent on Windows. It's 'NUL' but good practice is to get it from os.devnull
I'm pretty new to programming and python, so disregard this if it's a stupid suggestion, but can't you just reroute the error messages to a text file instead of the screen/interactive window/whatever?
I'm pretty sure I read somewhere you can do something like
error = open('yourerrorlog.txt','w')
sys.stderr = error
Again, sorry I'm probably repeating something you already know, but if the problem is you WANT the errors when it's called by another function (automated) and no errors when it's ran manual, can't you just add a keyword argument like compile(arg1, arg2, manual=True ) and then under your "except:" you add
if manual == False: print errors to console/interactive window
else: print to error
Then when it's called by the program and not manually you just call it with compile(arg1,arg2, manual=False) so that it redirects to the file.
Does running in quiet mode help at all? setup.py -q build
Not a direct answer to your question, but related to your use case: there is a config command in distutils that’s designed to be subclassed and used to check for C features. It’s not documented yet, you have to read the source.