Python multiprocessing doesn't play nicely with uuid.uuid4()

Python multiprocessing doesn't play nicely with uuid.uuid4() - python

I'm trying to generate a uuid for a filename, and I'm also using the multiprocessing module. Unpleasantly, all of my uuids end up exactly the same. Here is a small example:
import multiprocessing
import uuid
def get_uuid( a ):
## Doesn't help to cycle through a bunch.
#for i in xrange(10): uuid.uuid4()
## Doesn't help to reload the module.
#reload( uuid )
## Doesn't help to load it at the last minute.
## (I simultaneously comment out the module-level import).
#import uuid
## uuid1() does work, but it differs only in the first 8 characters and includes identifying information about the computer.
#return uuid.uuid1()
return uuid.uuid4()
def main():
pool = multiprocessing.Pool( 20 )
uuids = pool.map( get_uuid, range( 20 ) )
for id in uuids: print id
if __name__ == '__main__': main()
I peeked into uuid.py's code, and it seems to depending-on-the-platform use some OS-level routines for randomness, so I'm stumped as to a python-level solution (to do something like reload the uuid module or choose a new random seed). I could use uuid.uuid1(), but only 8 digits differ and I think there are derived exclusively from the time, which seems dangerous especially given that I'm multiprocessing (so the code could be executing at exactly the same time). Is there some Wisdom out there about this issue?

This is the correct way to generate your own uuid4, if you need to do that:
import os, uuid
return uuid.UUID(bytes=os.urandom(16), version=4)
Python should be doing this automatically--this code is right out of uuid.uuid4, when the native _uuid_generate_random doesn't exist. There must be something wrong with your platform's _uuid_generate_random.
If you have to do this, don't just work around it yourself and let everyone else on your platform suffer; report the bug.

I dont see a way to make this work either. But you could just generate all the uuids in the main thread and pass them to the workers.

This works fine for me. Does your Python installation have os.urandom? If not, random number seeding will be very poor and would lead to this problem (assuming there's also no native UUID module, uuid._uuid_generate_random).

Currently, I am working on a script, which fetches file either from a zip archive or disk. After fetching, the payload gets pushed to an external tool via web API.
For performance reason, I used the multiprocessing.Pool.map method. And for the tmp file name uuid looked quite handy. But I ran into the same issue you asked here.
First please check out the official docs from uuid. There is an class attribute called is_safe which provides more information if the uuid is multiprocess safe or not. In my case it was not.
After some research, I finally changed my strategy and moved from uuid to process pid and name.
Because I just need the uuid for tmp file naming, pid and name also works fine. We can access the current worker Process instance via multiprocessing.current_process(). If you really need an uuid, you could potentially integrate the worker pid somehow.
In addition, uuid uses system entropy for the generation (uuid source). Because for me it does not matter how the file is named, this solution also prevents laking entropy.

Related

How to unit test program interacting with block devices

I have a program that interacts with and changes block devices (/dev/sda and such) on linux. I'm using various external commands (mostly commands from the fdisk and GNU fdisk packages) to control the devices. I have made a class that serves as the interface for most of the basic actions with block devices (for information like: What size is it? Where is it mounted? etc.)
Here is one such method querying the size of a partition:
def get_drive_size(device):
"""Returns the maximum size of the drive, in sectors.
:device the device identifier (/dev/sda and such)"""
query_proc = subprocess.Popen(["blockdev", "--getsz", device], stdout=subprocess.PIPE)
#blockdev returns the number of 512B blocks in a drive
output, error = query_proc.communicate()
exit_code = query_proc.returncode
if exit_code != 0:
raise Exception("Non-zero exit code", str(error, "utf-8")) #I have custom exceptions, this is slight pseudo-code
return int(output) #should always be valid
So this method accepts a block device path, and returns an integer. The tests will run as root, since this entire program will end up having to run as root anyway.
Should I try and test code such as these methods? If so, how? I could try and create and mount image files for each test, but this seems like a lot of overhead, and is probably error-prone itself. It expects block devices, so I cannot operate directly on image files in the file system.
I could try mocking, as some answers suggest, but this feels inadequate. It seems like I start to test the implementation of the method, if I mock the Popen object, rather than the output. Is this a correct assessment of proper unit-testing methodology in this case?
I am using python3 for this project, and I have not yet chosen a unit-testing framework. In the absence of other reasons, I will probably just use the default unittest framework included in Python.

You should look into the mock module (I think it's part of the unittest module now in Python 3).
It enables you to run tests without the need to depened in any external resources while giving you control over how the mocks interact with your code.
I would start from the docs in Voidspace
Here's an example:
import unittest2 as unittest
import mock
class GetDriveSizeTestSuite(unittest.TestCase):
#mock.patch('path/to/original/file.subprocess.Popen')
def test_a_scenario_with_mock_subprocess(self, mock_popen):
mock_popen.return_value.communicate.return_value = ('Expected_value', '')
mock_popen.return_value.returncode = '0'
self.assertEqual('expected_value', get_drive_size('some device'))

How to read LV2 ttl file in Python?

I have an LV2 plugin and I want to use Python to extract its metadata - plugin name, description, list of control and audio ports and specification of each port.
With LADSPA the instructions were pretty clear, although a bit difficult to implement in Python: I just needed to call ladspa_descriptor() function. Now with LV2 there's a .ttl file, simples to access but more complicated to parse.
Is there any python library that will make this job simple?

The LV2 documentation generation tools use RDFLib. It is probably the most popular RDF interface for Python, though does much more than just parse Turtle. It is a good choice if performance is not an issue, but is unfortunately really slow.
If you need to actually instantiate and use plugins, you probably want to use an existing LV2 implementation. As Steve mentioned, Lilv is for this. It is not limited to any static default location, but will look in all the locations in LV2_PATH. You can set this environment variable to whatever you want before calling Lilv and it will only look in those locations. Alternatively, if you want to specifically load just one bundle at a time, there is a function for that: lilv_world_load_bundle().
There are SWIG-based Python bindings included with Lilv, but they stop short of actually allowing you to process data. However there is a project to wrap Lilv that allows processing of audio using scipy arrays: http://pyslv2.sourceforge.net/ (despite the name they are indeed Lilv bindings and not bindings for its predecessor SLV2)
That said, if you only need to get static information from the Turtle files, involving C libraries is probably more trouble than it is worth. One of the big advantages of using standard data files is ease of use with existing tools. To get the number of ports on a plugin, you simply need to count the number of triples that match the pattern (plugin, lv2:port, *). Here is an example Python script that prints the number of ports of a plugin, given the file to read and the plugin URI as command line arguments:
#!/usr/bin/env python
import rdflib
import sys
lv2 = rdflib.Namespace('http://lv2plug.in/ns/lv2core#')
path = sys.argv[1]
plugin = rdflib.URIRef(sys.argv[2])
model = rdflib.ConjunctiveGraph()
model.parse(path, format='n3')
num_ports = 0
for i in model.triples(plugin, lv2.port, None]):
num_ports += 1
print('%s has %u ports' % (plugin, num_ports))

This is how to get the number of ports each plugin supports:
w = lilv.World()
w.load_all()
for p in w.get_all_plugins():
print p.get_name().as_string(), p.get_num_ports()
At least this is all i got while trying to figure this out.

how to specify the name of my daemon process using pydaemon

I'm using pydaemon ( http://www.python.org/dev/peps/pep-3143/ ) to make a friendly daemon. How do I give it a name? by default it's called 'python' but I want something more meaningful.

Changing the process name cannot be done from plain Python, and pydaemon is 100% Python. You need a C-level library like py-setproctitle to do that. Then, simply add the following to your main method:
try:
import setproctitle
setproctitle.setproctitle('my-awesome-program')
except:
pass # Ignore errors, since this is only cosmetic

Accessing samba shares with gio in python

I am trying to make a simple command line client for accessing shares via the Python bindings of gio (yes, the main requirement is to use gio).
I can see that comparing with it's predecessor gnome-vfs, it provides some means to do authentication stuff (subclassing MountOperation), and even some methods which are quite specific to samba shares, like set_domain().
But I'm stuck with this code:
import gio
fh = gio.File("smb://server_name/")
If that server needs authentication, I suppose that a call to fh.mount_enclosing_volume() is needed, as this methods takes a MountOperation as a parameter. The problem is that calling this methods does nothing, and the logical fh.enumerate_children() (to list the available shares) that comes next fails.
Anybody could provide a working example of how this would be done with gio ?

The following appears to be the minimum code needed to mount a volume:
def mount(f):
op = gio.MountOperation()
op.connect('ask-password', ask_password_cb)
f.mount_enclosing_volume(op, mount_done_cb)
def ask_password_cb(op, message, default_user, default_domain, flags):
op.set_username(USERNAME)
op.set_domain(DOMAIN)
op.set_password(PASSWORD)
op.reply(gio.MOUNT_OPERATION_HANDLED)
def mount_done_cb(obj, res):
obj.mount_enclosing_volume_finish(res)
(Derived from gvfs-mount.)
In addition, you may need a glib.MainLoop running because GIO mount functions are asynchronous. See the gvfs-mount source code for details.

How can I strip Python logging calls without commenting them out?

Today I was thinking about a Python project I wrote about a year back where I used logging pretty extensively. I remember having to comment out a lot of logging calls in inner-loop-like scenarios (the 90% code) because of the overhead (hotshot indicated it was one of my biggest bottlenecks).
I wonder now if there's some canonical way to programmatically strip out logging calls in Python applications without commenting and uncommenting all the time. I'd think you could use inspection/recompilation or bytecode manipulation to do something like this and target only the code objects that are causing bottlenecks. This way, you could add a manipulator as a post-compilation step and use a centralized configuration file, like so:
[Leave ERROR and above]
my_module.SomeClass.method_with_lots_of_warn_calls
[Leave WARN and above]
my_module.SomeOtherClass.method_with_lots_of_info_calls
[Leave INFO and above]
my_module.SomeWeirdClass.method_with_lots_of_debug_calls
Of course, you'd want to use it sparingly and probably with per-function granularity -- only for code objects that have shown logging to be a bottleneck. Anybody know of anything like this?
Note: There are a few things that make this more difficult to do in a performant manner because of dynamic typing and late binding. For example, any calls to a method named debug may have to be wrapped with an if not isinstance(log, Logger). In any case, I'm assuming all of the minor details can be overcome, either by a gentleman's agreement or some run-time checking. :-)

What about using logging.disable?
I've also found I had to use logging.isEnabledFor if the logging message is expensive to create.

Use pypreprocessor
Which can also be found on PYPI (Python Package Index) and be fetched using pip.
Here's a basic usage example:
from pypreprocessor import pypreprocessor
pypreprocessor.parse()
#define nologging
#ifdef nologging
...logging code you'd usually comment out manually...
#endif
Essentially, the preprocessor comments out code the way you were doing it manually before. It just does it on the fly conditionally depending on what you define.
You can also remove all of the preprocessor directives and commented out code from the postprocessed code by adding 'pypreprocessor.removeMeta = True' between the import and
parse() statements.
The bytecode output (.pyc) file will contain the optimized output.
SideNote: pypreprocessor is compatible with python2x and python3k.
Disclaimer: I'm the author of pypreprocessor.

I've also seen assert used in this fashion.
assert logging.warn('disable me with the -O option') is None
(I'm guessing that warn always returns none.. if not, you'll get an AssertionError
But really that's just a funny way of doing this:
if __debug__: logging.warn('disable me with the -O option')
When you run a script with that line in it with the -O option, the line will be removed from the optimized .pyo code. If, instead, you had your own variable, like in the following, you will have a conditional that is always executed (no matter what value the variable is), although a conditional should execute quicker than a function call:
my_debug = True
...
if my_debug: logging.warn('disable me by setting my_debug = False')
so if my understanding of debug is correct, it seems like a nice way to get rid of unnecessary logging calls. The flipside is that it also disables all of your asserts, so it is a problem if you need the asserts.

As an imperfect shortcut, how about mocking out logging in specific modules using something like MiniMock?
For example, if my_module.py was:
import logging
class C(object):
def __init__(self, *args, **kw):
logging.info("Instantiating")
You would replace your use of my_module with:
from minimock import Mock
import my_module
my_module.logging = Mock('logging')
c = my_module.C()
You'd only have to do this once, before the initial import of the module.
Getting the level specific behaviour would be simple enough by mocking specific methods, or having logging.getLogger return a mock object with some methods impotent and others delegating to the real logging module.
In practice, you'd probably want to replace MiniMock with something simpler and faster; at the very least something which doesn't print usage to stdout! Of course, this doesn't handle the problem of module A importing logging from module B (and hence A also importing the log granularity of B)...
This will never be as fast as not running the log statements at all, but should be much faster than going all the way into the depths of the logging module only to discover this record shouldn't be logged after all.

You could try something like this:
# Create something that accepts anything
class Fake(object):
def __getattr__(self, key):
return self
def __call__(self, *args, **kwargs):
return True
# Replace the logging module
import sys
sys.modules["logging"] = Fake()
It essentially replaces (or initially fills in) the space for the logging module with an instance of Fake which simply takes in anything. You must run the above code (just once!) before the logging module is attempted to be used anywhere. Here is a test:
import logging
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)-8s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='/temp/myapp.log',
filemode='w')
logging.debug('A debug message')
logging.info('Some information')
logging.warning('A shot across the bows')
With the above, nothing at all was logged, as was to be expected.

I'd use some fancy logging decorator, or a bunch of them:
def doLogging(logTreshold):
def logFunction(aFunc):
def innerFunc(*args, **kwargs):
if LOGLEVEL >= logTreshold:
print ">>Called %s at %s"%(aFunc.__name__, time.strftime("%H:%M:%S"))
print ">>Parameters: ", args, kwargs if kwargs else ""
try:
return aFunc(*args, **kwargs)
finally:
print ">>%s took %s"%(aFunc.__name__, time.strftime("%H:%M:%S"))
return innerFunc
return logFunction
All you need is to declare LOGLEVEL constant in each module (or just globally and just import it in all modules) and then you can use it like this:
#doLogging(2.5)
def myPreciousFunction(one, two, three=4):
print "I'm doing some fancy computations :-)"
return
And if LOGLEVEL is no less than 2.5 you'll get output like this:
>>Called myPreciousFunction at 18:49:13
>>Parameters: (1, 2)
I'm doing some fancy computations :-)
>>myPreciousFunction took 18:49:13
As you can see, some work is needed for better handling of kwargs, so the default values will be printed if they are present, but that's another question.
You should probably use some logger module instead of raw print statements, but I wanted to focus on the decorator idea and avoid making code too long.
Anyway - with such decorator you get function-level logging, arbitrarily many log levels, ease of application to new function, and to disable logging you only need to set LOGLEVEL. And you can define different output streams/files for each function if you wish. You can write doLogging as:
def doLogging(logThreshold, outStream=sys.stdout):
.....
print >>outStream, ">>Called %s at %s" etc.
And utilize log files defined on a per-function basis.

This is an issue in my project as well--logging ends up on profiler reports pretty consistently.
I've used the _ast module before in a fork of PyFlakes (http://github.com/kevinw/pyflakes) ... and it is definitely possible to do what you suggest in your question--to inspect and inject guards before calls to logging methods (with your acknowledged caveat that you'd have to do some runtime type checking). See http://pyside.blogspot.com/2008/03/ast-compilation-from-python.html for a simple example.
Edit: I just noticed MetaPython on my planetpython.org feed--the example use case is removing log statements at import time.
Maybe the best solution would be for someone to reimplement logging as a C module, but I wouldn't be the first to jump at such an...opportunity :p

:-) We used to call that a preprocessor and although C's preprocessor had some of those capablities, the "king of the hill" was the preprocessor for IBM mainframe PL/I. It provided extensive language support in the preprocessor (full assignments, conditionals, looping, etc.) and it was possible to write "programs that wrote programs" using just the PL/I PP.
I wrote many applications with full-blown sophisticated program and data tracing (we didn't have a decent debugger for a back-end process at that time) for use in development and testing which then, when compiled with the appropriate "runtime flag" simply stripped all the tracing code out cleanly without any performance impact.
I think the decorator idea is a good one. You can write a decorator to wrap the functions that need logging. Then, for runtime distribution, the decorator is turned into a "no-op" which eliminates the debugging statements.
Jon R

I am doing a project currently that uses extensive logging for testing logic and execution times for a data analysis API using the Pandas library.
I found this string with a similar concern - e.g. what is the overhead on the logging.debug statements even if the logging.basicConfig level is set to level=logging.WARNING
I have resorted to writing the following script to comment out or uncomment the debug logging prior to deployment:
import os
import fileinput
comment = True
# exclude files or directories matching string
fil_dir_exclude = ["__","_archive",".pyc"]
if comment :
## Variables to comment
source_str = 'logging.debug'
replace_str = '#logging.debug'
else :
## Variables to uncomment
source_str = '#logging.debug'
replace_str = 'logging.debug'
# walk through directories
for root, dirs, files in os.walk('root/directory') :
# where files exist
if files:
# for each file
for file_single in files :
# build full file name
file_name = os.path.join(root,file_single)
# exclude files with matching string
if not any(exclude_str in file_name for exclude_str in fil_dir_exclude) :
# replace string in line
for line in fileinput.input(file_name, inplace=True):
print "%s" % (line.replace(source_str, replace_str)),
This is a file recursion that excludes files based on a list of criteria and performs an in place replace based on an answer found here: Search and replace a line in a file in Python

I like the 'if __debug_' solution except that putting it in front of every call is a bit distracting and ugly. I had this same problem and overcame it by writing a script which automatically parses your source files and replaces logging statements with pass statements (and commented out copies of the logging statements). It can also undo this conversion.
I use it when I deploy new code to a production environment when there are lots of logging statements which I don't need in a production setting and they are affecting performance.
You can find the script here: http://dound.com/2010/02/python-logging-performance/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.