Is there a way to really pickle compiled regular expressions in python?

Is there a way to really pickle compiled regular expressions in python? - python

I have a python console application that contains 300+ regular expressions. The set of regular expressions is fixed for each release. When users run the app, the entire set of regular expressions will be applied anywhere from once (a very short job) to thousands of times (a long job).
I would like to speed up the shorter jobs by compiling the regular expressions up front, pickle the compiled regular expressions to a file, and then load that file when the application is run.
The python re module is efficient and the regex compilation overhead is quite acceptable for long jobs. For short jobs, however, it is a large proportion of the overall run-time. Some users will want to run many small jobs to fit into their existing workflows. Compiling the regular expressions takes about 80ms. A short job might take 20ms-100ms excluding regular expression compilation. So for short jobs, the overhead can be 100% or more. This is with Python27 under both Windows and Linux.
The regular expressions must be applied with the DOTALL flag, so need to be compiled prior to use. A large compilation cache clearly doesn't help in this instances. As some have pointed out, the default method to serialise the compiled regular expression doesn't actually do much.
The re and sre modules compile the patterns into a little custom language with its own opcodes and some auxiliary data structures (e.g., for charsets used in an expression). The pickle function in re.py takes the easy way out. It is:
def _pickle(p):
return _compile, (p.pattern, p.flags)
copy_reg.pickle(_pattern_type, _pickle, _compile)
I think that a good solution to the problem would be an update to the definition of _pickle in re.py that actually pickled the compiled pattern object. Unfortunately, this goes beyond my python skills. I bet, however, that someone here knows how to do it.
I realise that I am not the first person to ask this question - but perhaps you can be the first person to give an accurate and useful response to it!
Your advice would be greatly appreciated.

OK, this isn't pretty, but it might be what you want. I looked at the sre_compile.py module from Python 2.6, and ripped out a bit of it, chopped it in half, and used the two pieces to pickle and unpickle compiled regexes:
import re, sre_compile, sre_parse, _sre
import cPickle as pickle
# the first half of sre_compile.compile
def raw_compile(p, flags=0):
# internal: convert pattern list to internal format
if sre_compile.isstring(p):
pattern = p
p = sre_parse.parse(p, flags)
else:
pattern = None
code = sre_compile._code(p, flags)
return p, code
# the second half of sre_compile.compile
def build_compiled(pattern, p, flags, code):
# print code
# XXX: <fl> get rid of this limitation!
if p.pattern.groups > 100:
raise AssertionError(
"sorry, but this version only supports 100 named groups"
)
# map in either direction
groupindex = p.pattern.groupdict
indexgroup = [None] * p.pattern.groups
for k, i in groupindex.items():
indexgroup[i] = k
return _sre.compile(
pattern, flags | p.pattern.flags, code,
p.pattern.groups-1,
groupindex, indexgroup
)
def pickle_regexes(regexes):
picklable = []
for r in regexes:
p, code = raw_compile(r, re.DOTALL)
picklable.append((r, p, code))
return pickle.dumps(picklable)
def unpickle_regexes(pkl):
regexes = []
for r, p, code in pickle.loads(pkl):
regexes.append(build_compiled(r, p, re.DOTALL, code))
return regexes
regexes = [
r"^$",
r"a*b+c*d+e*f+",
]
pkl = pickle_regexes(regexes)
print pkl
print unpickle_regexes(pkl)
I don't really know if this works, or if it speeds things up. I know it prints a list of regexes when I try it. It might be very specific to version 2.6, I also don't know that.

As others have mentioned, you can simply pickle the compiled regex. They will pickle and unpickle just fine, and be usable. However, it doesn't look like the pickle actually contains the result of compilation. I suspect you will incur the compilation overhead again when you use the result of the unpickling.
>>> p.dumps(re.compile("a*b+c*"))
"cre\n_compile\np1\n(S'a*b+c*'\np2\nI0\ntRp3\n."
>>> p.dumps(re.compile("a*b+c*x+y*"))
"cre\n_compile\np1\n(S'a*b+c*x+y*'\np2\nI0\ntRp3\n."
In these two tests, you can see the only difference between the two pickles is in the string. Apparently compiled regexes don't pickle the compiled bits, just the string needed to compile it again.
But I'm wondering about your application overall: compiling a regex is a fast operation, how short are your jobs that compiling the regex is significant? One possibility is that you are compiling all 300 regexes, and then only using one for a short job. In that case, don't compile them all up front. The re module is very good at using cached copies of compiled regexes, so you generally don't have to compile them yourself, just use the string form. The re module will lookup the string in a dictionary of compiled regexes, so grabbing the compiled form yourself only saves you a dictionary look up. I may be totally off-base, sorry if so.

Just compile as you go - re module will cache the compiled re's even if you dont. Bump the re._MAXCACHE up to 400 or 500, the short jobs will only compile the re's they need, and the long jobs benefit from a big fat cache of compiled expressions - everybody's happy!

Some observations and musings:
You don't need to compile to get the effect of the re.DOTALL flag (or any other flag)-- all you need to do is insert (?s) at the start of the pattern string ... re.DOTALL -> re.S -> the s in (?s). Do a Ctrl-F search for sux (sic) in the re syntax docs.
80ms seems a very short time, even when multiplied by "many" (how many??) short jobs.
Does each job require a new Python process to be started? If so, isn't 80ms small compared with process startup and shutdown overhead? Otherwise, please explain why it is not possible, when a user wants to run "many" small jobs, to do the re.compiles once per batch of jobs.

In a similar case (where every time some input needs to be run through ALL of the regexes), I had to split the Python script in a master-slave setup using *nix sockets; the first time the script is called, the master —doing all time-expensive regex compilations— starts up and the slave for that and all subsequent invokations exchanges data with the master. The master stays idle maximum N seconds.
In my case, this master/slave setup was found to be faster in all occasions than the straightforward way (many invokations against relatively little data every time; also, it had to be a script because it is called from an external application without any Python bindings). I don't know whether this would apply to your situation.

I had the same problem and instead of patching python's re module I opted to create a long running regex "service" instead. Basic code appended below. Please note: It is not designed to handle multiple clients in parallel, i.e. the server is only available once a client has closed the connection.
server
from multiprocessing.connection import Client
from multiprocessing.connection import Listener
import re
class RegexService(object):
patternsByRegex = None
def __init__(self):
self.patternsByRegex = {}
def processMessage(self, message):
regex = message.get('regex')
result = {"error": None}
if regex == None:
result["error"] = "no regex in message - something is wrong with your client"
return result
text = message.get('text')
pattern = self.patternsByRegex.get(regex)
if pattern == None:
print "compiling previously unseen regex: %s" %(regex)
pattern = re.compile(regex, re.IGNORECASE)
self.patternsByRegex[regex] = pattern
if text == None:
result["error"] = "no match"
return result
match = pattern.match(text)
result["matchgroups"] = None
if match == None:
return result
result["matchgroups"] = match.groups()
return result
workAddress = ('localhost', 6000)
resultAddress = ('localhost', 6001)
listener = Listener(workAddress, authkey='secret password')
service = RegexService()
patterns = {}
while True:
connection = listener.accept()
resultClient = Client(resultAddress, authkey='secret password')
while True:
try:
message = connection.recv()
resultClient.send(service.processMessage(message))
except EOFError:
resultClient.close()
connection.close()
break
listener.close()
testclient
from multiprocessing.connection import Client
from multiprocessing.connection import Listener
workAddress = ('localhost', 6000)
resultAddress = ('localhost', 6001)
regexClient = Client(workAddress, authkey='secret password')
resultListener = Listener(resultAddress, authkey='secret password')
resultConnection = None
def getResult():
global resultConnection
if resultConnection == None:
resultConnection = resultListener.accept()
return resultConnection.recv()
regexClient.send({
"regex": r'.*'
})
print str(getResult())
regexClient.send({
"regex": r'.*',
"text": "blub"
})
print str(getResult())
regexClient.send({
"regex": r'(.*)',
"text": "blub"
})
print str(getResult())
resultConnection.close()
regexClient.close()
output of test client run 2 times
$ python ./regexTest.py
{'error': 'no match'}
{'matchgroups': (), 'error': None}
{'matchgroups': ('blub',), 'error': None}
$ python ./regexTest.py
{'error': 'no match'}
{'matchgroups': (), 'error': None}
{'matchgroups': ('blub',), 'error': None}
output of service process during both test runs
$ python ./regexService.py
compiling previously unseen regex: .*
compiling previously unseen regex: (.*)

As long as you create them on program start, the pyc file will cache them. You don't need to result to pickling.

Related

How to get results out of a Python exec()/eval() call?

I want to write a tool in Python to prepare a simulation study by creating for each simulation run a folder and a configuration file with some run-specific parameters.
study/
study.conf
run1
run.conf
run2
run.conf
The tool should read the overall study configuration from a file including (1) static parameters (key-value pairs), (2) lists for iteration parameters, and (3) some small code snippets to calculate further parameters from the previous ones. The latter are run specific depending on the permutation of the iteration parameters used.
Before writing the run.conf files from a template, I need to run some code like this to determine the specific key-value pairs from the code snippets for that run
code = compile(code_str, 'foo.py', 'exec')
rv=eval(code, context, { })
However, as this is confirmed by the Python documentation, this just leads to a None as return value.
The code string and context dictionary in the example are filled elsewhere. For this discussion, this snippet should do it:
code_str="""import math
math.sqrt(width**2 + height**2)
"""
context = {
'width' : 30,
'height' : 10
}
I have done this before in Perl and Java+JavaScript. There, you just give the code snippet to some evaluation function or script engine and get in return a value (object) from the last executed statement -- not a big issue.
Now, in Python I struggle with the fact that eval() is too narrow just allowing one statement and exec() doesn't return values in general. I need to import modules and sometimes do some slightly more complex calculations, e.g., 5 lines of code.
Isn't there a better solution that I don't see at the moment?
During my research, I found some very good discussions about Pyhton eval() and exec() and also some tricky solutions to circumvent the issue by going via the stdout and parsing the return value from there. The latter would do it, but is not very nice and already 5 years old.

The exec function will modify the global parameter (dict) passed to it. So you can use the code below
code_str="""import math
Result1 = math.sqrt(width**2 + height**2)
"""
context = {
'width' : 30,
'height' : 10
}
exec(code_str, context)
print (context['Result1']) # 31.6
Every variable code_str created will end up with a key:value pair in the context dictionary. So the dict is the "object" like you mentioned in JavaScript.
Edit1:
If you only need the result of the last line in code_str and try to prevent something like Result1=..., try the below code
code_str="""import math
math.sqrt(width**2 + height**2)
"""
context = { 'width' : 30, 'height' : 10 }
lines = [l for l in code_str.split('\n') if l.strip()]
lines[-1] = '__myresult__='+lines[-1]
exec('\n'.join(lines), context)
print (context['__myresult__'])
This approach is not as robust as the former one, but should work for most case. If you need to manipulate the code in a sophisticated way, please take a look at the Abstract Syntax Trees

Since this whole exec() / eval() thing in Python is a bit weird ... I have chose to re-design the whole thing based on a proposal in the comments to my question (thanks #jonrsharpe).
Now, the whole study specification is a .py module that the user can edit. From there, the configuration setup is directly written to a central object of the whole package. On tool runs, the configuration module is imported using the code below
import imp
# import the configuration as a module
(path, name) = os.path.split(filename)
(name, _) = os.path.splitext(name)
(file, filename, data) = imp.find_module(name, [path])
try:
module = imp.load_module(name, file, filename, data)
except ImportError as e:
print(e)
sys.exit(1)
finally:
file.close()

I came across similar needs, and finally figured out a approach by playing with ast:
import ast
code = """
def tf(n):
return n*n
r=tf(3)
{"vvv": tf(5)}
"""
ast_ = ast.parse(code, '<code>', 'exec')
final_expr = None
for field_ in ast.iter_fields(ast_):
if 'body' != field_[0]: continue
if len(field_[1]) > 0 and isinstance(field_[1][-1], ast.Expr):
final_expr = ast.Expression()
final_expr.body = field_[1].pop().value
ld = {}
rv = None
exec(compile(ast_, '<code>', 'exec'), None, ld)
if final_expr:
rv = eval(compile(final_expr, '<code>', 'eval'), None, ld)
print('got locals: {}'.format(ld))
print('got return: {}'.format(rv))
It'll eval instead of exec the last clause if it's an expression, or have all execed and return None.
Output:
got locals: {'tf': <function tf at 0x10103a268>, 'r': 9}
got return: {'vvv': 25}

python dragonfly to recognize similar words

I am doing a program with dragon fly using wsr,where it has to analyse a word,any voice matching that word should output 'yes it matches'
If i say 'czechoslovakia' then it must print true even for all the similar matches of this world ,like words for 'circle slovakia, cat on slavia,seko vakia...'
What specific methods,should i use for this?
My program
from dragonfly.all import *
import pythoncom
import time
# Voice command rule combining spoken form and recognition processing.
class ExampleRule(CompoundRule):
spec = "czechoslovakia|circle slovalia|sceko bakia|cat on ania" # Spoken form of command.
def _process_recognition(self, node, extras): # Callback when command is spoken.
print "Voice command spoken."
# Create a grammar which contains and loads the command rule.
grammar = Grammar("example grammar") # Create a grammar to contain the command rule.
grammar.add_rule(ExampleRule()) # Add the command rule to the grammar.
grammar.load() # Load the grammar.
while True:
pythoncom.PumpWaitingMessages()
time.sleep(.1)

There is nothing built into Dragonfly to allow you to do this, but you have some other options.
If you're looking to dynamically generate the spec, you might want
to look at Fuzzy. You could give it a word and use it to generate
other similar sounding words from that word. Then you could create
the spec from them.
Here is the WSR engine class in Dragonfly.
I don't know much about SAPI5, but you might be able to ask it for
alternatives. If you can, you might be able to extend the
Dragonfly GrammarWrapper to expose the alternatives, and then use a
catchall grammar to save all utterances and then filter out what you
want (possibly using Fuzzy).
If you were using Natlink, I would recommend
looking at the results object. As you can see here, the results
object has access to all of Dragon's different hypotheses for what
you said in a given utterance. Just as with my second suggestion,
you could catch everything and then filter what you wanted:
.
from natlinkutils import GrammarBase
class CatchAll(GrammarBase):
# this spec will catch everything
gramSpec = """
<start> exported = {emptyList};
"""
def initialize(self):
self.load(self.gramSpec, allResults=1)
self.activateAll()
def gotResultsObject(self, recogType, resObj):
for x in range(0, 100):
try:
possible_interpretation = resObj.getWords(x)
# do whatever sort of filtering you want here
except Exception:
break
c = CatchAll()
c.initialize()
def unload():
global c
if c:
c.unload()
c = None

How to read file capabilities using Python?

On Linux systems root privileges can be granted more selectively than adding the setuid bit using file capabilities. See capabilities(7) for details. These are attributes of files and can be read using the getcap program. How can these attributes be retrieved in Python?
Even though running the getcap program using e.g. subprocess for answering such a question is possible it is not desirable when retrieving very many capabilities.
It should be possible to devise a solution using ctypes. Are there alternatives to this approach or even libraries facilitating this task?

Python 3.3 comes with os.getxattr. If not, yeah... one way would be using ctypes, at least to get the raw stuff, or maybe use pyxattr
For pyxattr:
>>> import xattr
>>> xattr.listxattr("/bin/ping")
(u'security.capability',)
>>> xattr.getxattr("/bin/ping", "security.capability")
'\x00\x00\x00\x02\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
For Python 3.3's version, it's essentially the same, just importing os, instead of xattr. ctypes is a bit more involved, though.
Now, we're getting the raw result, meaning that those two are most useful only retrieving textual attributes. But... we can use the same approach of getcap, through libcap itself:
import ctypes
libcap = ctypes.cdll.LoadLibrary("libcap.so")
cap_t = libcap.cap_get_file('/bin/ping')
libcap.cap_to_text.restype = ctypes.c_char_p
libcap.cap_to_text(cap_t, None)
which gives me:
'= cap_net_raw+p'
probably more useful for you.
PS: note that cap_to_text returns a malloced string. It's your job to deallocate it using cap_free
Hint about the "binary gibberish":
>>> import struct
>>> caps = '\x00\x00\x00\x02\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> struct.unpack("<IIIII", caps)
(33554432, 8192, 0, 0, 0)
In that 8192, the only active bit is the 13th. If you go to linux/capability.h, you'll see that CAP_NET_RAW is defined at 13.
Now, if you wan to write a module with all those constants, you can decode the info. But I'd say it's much more laborious than just using ctypes + libcap.

I tried the code from Ricardo Cárdenes's answer, but it did not work properly for me, because some details of the ctypes invocation incorrect. This issue caused a truncated path string to be passed to getxattr(...) inside of libcap, which thus returned the wrong capabilities list for the wrong item (the / directory, or other first path character, and not the actual path).
It is very important to remember and account for the difference between str and bytes in Python 3.X. This code works properly on Python 3.5/3.6:
#!/usr/bin/env python3
import ctypes
import os
import sys
# load shared library
libcap = ctypes.cdll.LoadLibrary('libcap.so')
class libcap_auto_c_char_p(ctypes.c_char_p):
def __del__(self):
libcap.cap_free(self)
# cap_t cap_get_file(const char *path_p)
libcap.cap_get_file.argtypes = [ctypes.c_char_p]
libcap.cap_get_file.restype = ctypes.c_void_p
# char* cap_to_text(cap_t caps, ssize_t *length_p)
libcap.cap_to_text.argtypes = [ctypes.c_void_p, ctypes.c_void_p]
libcap.cap_to_text.restype = libcap_auto_c_char_p
def cap_get_file(path):
cap_t = libcap.cap_get_file(path.encode('utf-8'))
if cap_t is None:
return ''
else:
return libcap.cap_to_text(cap_t, None).value.decode('utf-8')
print(cap_get_file('/usr/bin/traceroute6.iputils'))
print(cap_get_file('/usr/bin/systemd-detect-virt'))
print(cap_get_file('/usr/bin/mtr'))
print(cap_get_file('/usr/bin/tar'))
print(cap_get_file('/usr/bin/bogus'))
The output will look like this (anything nonexistent, or with no capabilities set just returns '':
= cap_net_raw+ep
= cap_dac_override,cap_sys_ptrace+ep
= cap_net_raw+ep

Python object persistence

I'm seeking advice about methods of implementing object persistence in Python. To be more precise, I wish to be able to link a Python object to a file in such a way that any Python process that opens a representation of that file shares the same information, any process can change its object and the changes will propagate to the other processes, and even if all processes "storing" the object are closed, the file will remain and can be re-opened by another process.
I found three main candidates for this in my distribution of Python - anydbm, pickle, and shelve (dbm appeared to be perfect, but it is Unix-only, and I am on Windows). However, they all have flaws:
anydbm can only handle a dictionary of string values (I'm seeking to store a list of dictionaries, all of which have string keys and string values, though ideally I would seek a module with no type restrictions)
shelve requires that a file be re-opened before changes propagate - for instance, if two processes A and B load the same file (containing a shelved empty list), and A adds an item to the list and calls sync(), B will still see the list as being empty until it reloads the file.
pickle (the module I am currently using for my test implementation) has the same "reload requirement" as shelve, and also does not overwrite previous data - if process A dumps fifteen empty strings onto a file, and then the string 'hello', process B will have to load the file sixteen times in order to get the 'hello' string. I am currently dealing with this problem by preceding any write operation with repeated reads until end of file ("wiping the slate clean before writing on it"), and by making every read operation repeated until end of file, but I feel there must be a better way.
My ideal module would behave as follows (with "A>>>" representing code executed by process A, and "B>>>" code executed by process B):
A>>> import imaginary_perfect_module as mod
B>>> import imaginary_perfect_module as mod
A>>> d = mod.load('a_file')
B>>> d = mod.load('a_file')
A>>> d
{}
B>>> d
{}
A>>> d[1] = 'this string is one'
A>>> d['ones'] = 1 #anydbm would sulk here
A>>> d['ones'] = 11
A>>> d['a dict'] = {'this dictionary' : 'is arbitrary', 42 : 'the answer'}
B>>> d['ones'] #shelve would raise a KeyError here, unless A had called d.sync() and B had reloaded d
11 #pickle (with different syntax) would have returned 1 here, and then 11 on next call
(etc. for B)
I could achieve this behaviour by creating my own module that uses pickle, and editing the dump and load behaviour so that they use the repeated reads I mentioned above - but I find it hard to believe that this problem has never occurred to, and been fixed by, more talented programmers before. Moreover, these repeated reads seem inefficient to me (though I must admit that my knowledge of operation complexity is limited, and it's possible that these repeated reads are going on "behind the scenes" in otherwise apparently smoother modules like shelve). Therefore, I conclude that I must be missing some code module that would solve the problem for me. I'd be grateful if anyone could point me in the right direction, or give advice about implementation.

Use the ZODB (the Zope Object Database) instead. Backed with ZEO it fulfills your requirements:
Transparent persistence for Python objects
ZODB uses pickles underneath so anything that is pickle-able can be stored in a ZODB object store.
Full ACID-compatible transaction support (including savepoints)
This means changes from one process propagate to all the other processes when they are good and ready, and each process has a consistent view on the data throughout a transaction.
ZODB has been around for over a decade now, so you are right in surmising this problem has already been solved before. :-)
The ZODB let's you plug in storages; the most common format is the FileStorage, which stores everything in one Data.fs with an optional blob storage for large objects.
Some ZODB storages are wrappers around others to add functionality; DemoStorage for example keeps changes in memory to facilitate unit testing and demonstration setups (restart and you have clean slate again). BeforeStorage gives you a window in time, only returning data from transactions before a given point in time. The latter has been instrumental in recovering lost data for me.
ZEO is such a plugin that introduces a client-server architecture. Using ZEO lets you access a given storage from multiple processes at a time; you won't need this layer if all you need is multi-threaded access from one process only.
The same could be achieved with RelStorage, which stores ZODB data in a relational database such as PostgreSQL, MySQL or Oracle.

For beginners, You can port your shelve databases to ZODB databases like this:
#!/usr/bin/env python
import shelve
import ZODB, ZODB.FileStorage
import transaction
from optparse import OptionParser
import os
import sys
import re
reload(sys)
sys.setdefaultencoding("utf-8")
parser = OptionParser()
parser.add_option("-o", "--output", dest = "out_file", default = False, help ="original shelve database filename")
parser.add_option("-i", "--input", dest = "in_file", default = False, help ="new zodb database filename")
parser.set_defaults()
options, args = parser.parse_args()
if options.in_file == False or options.out_file == False :
print "Need input and output database filenames"
exit(1)
db = shelve.open(options.in_file, writeback=True)
zstorage = ZODB.FileStorage.FileStorage(options.out_file)
zdb = ZODB.DB(zstorage)
zconnection = zdb.open()
newdb = zconnection.root()
for key, value in db.iteritems() :
print "Copying key: " + str(key)
newdb[key] = value
transaction.commit()

I suggest using TinyDB, it's much much better and simple to use.
https://tinydb.readthedocs.io/en/stable/

msiexec scripting in python

Most of this is background, skip the next 3 paragraphs for the question:
I have developed a tool that calls some installers, changes registry items, and moves files around to help me test a product which has a fairly fast update cycle. So far so good, I have a GUI which runs in a separate process to the business logic to prevent it locking due to the GIL, everything works etc, however I have concerns with a section of my code where I make calls to msiexec.
Specifically it's the uninstall part which gives me concerns. Currently the GUID does not change so I am able to uninstall the product using an os.system('msiexec /x "{GUID}" /passive') sort of thing. It's actually a bit more complicated as I'm using subprocess.Popen and polling it until it finished from within an event loop to allow for concurrency with other steps.
My concern is that should the GUID change, obviously this will not work. I don't want to point msiexec directly at the installation source, as this would mean that it wouldn't work if I were to 'lose' the msi file, which I store in a temporary directory.
What I am looking for, is a way of querying by program name to get the GUID, or even a wrapper for msiexec that would do all of this, including the uninstall, for me. I thought of scanning through the registry, but the _winreg module seems very slow, so I'd prefer to avoid this if at all possible. If there's a better way to scan the registry, I'm all ears, as this would speed up other parts of the tool also.
Update0
Performance on this is critical as one of the design goals is to make the process which the tool follows faster than any other method, manual or otherwise, in order to gain wholesale adoption.
Update1
I have tried a slight variation of the registry version below however it consistently returns None. I'm not quite sure how this is happening - it seems like it is failing to open the appropriate key as I have inserted a breakpoint after the with statement which is never reached...
def get_guid_by_name(name):
from _winreg import (OpenKey,
QueryInfoKey,
EnumKey,
QueryValueEx,
HKEY_LOCAL_MACHINE,
)
with OpenKey(HKEY_LOCAL_MACHINE,
r'SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall') as key:
subkeys, _0, _1 = QueryInfoKey(key) # The breakpoint here is never reached
del _0, _1
for i in range(subkeys):
subkey = EnumKey(key, i)
if subkey[0] != '{' or subkey[-1] != '}':
continue
with OpenKey(key, subkey) as _subkey:
if name in QueryValueEx(_subkey, 'DisplayName')[0]:
return subkey
return None
print get_guid_by_name('Microsoft Visual Studio')
Update2
Strike that - I'm a fool who doesn't check his indentation thoroughly enough - print get_guid_by_name('Microsoft Visual Studio') was within get_guid_by_name...

I'm not sure about the _winreg module being all that slow. I suppose if you were trying to enumerate the entire registry to find all instances of a string that might take a while, but with a decently targeted query it seems reasonably fast.
Here's an example:
from _winreg import *
def get_guid_by_name(name):
# Open the uninstaller key
with OpenKey(HKEY_LOCAL_MACHINE, r'Software\Microsoft\Windows\CurrentVersion\Uninstall') as key:
# We only care about subkeys of the installer key
subkeys, _, _ = QueryInfoKey(key)
for i in range(subkeys):
subkey = EnumKey(key, i)
# Since we're looking for uninstallers for MSI products,
# the key name will always be the GUID. We assume that any
# key starting with '{' and ending with '}' is a GUID, but
# if not the name won't match.
if subkey[0] != '{' or subkey[-1] != '}':
continue
# Query the display name or other property of the key to
# see if it's the one we want
with OpenKey(key, subkey) as _subkey:
if QueryValueEx(_subkey, 'DisplayName')[0] == name:
return subkey
return None
On my machine, querying for ActiveState's Komodo Edit (I actually used a regular expression rather than straight-value comparison), 1000 iterations of this took 8.18 seconds (timed using timeit), which seems like a negligible amount of time to me. Better yet, you can pull the UninstallString key from the registry and pass that straight to your subprocess (though you may want to add the /passive switch to the end.
Edit
Microsoft does, of course, provide a WMI class (Win32_Product) that provides a rather convenient interface to do all of this. Using Tim Golden's excellent WMI wrapper, one could initiate an install like this:
import wmi
c = wmi.WMI()
c.Win32_Product(Name = 'ProductName')[0].Uninstall()
However, as noted in this blog post, the Win32_Product class is extremely, painfully slow to use.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.