Disable code inspection with Cython and compilation

Disable code inspection with Cython and compilation - python

I have a python script which I have transformed into a pyd file using Cython and Gcc. I would like it to be a black box, so that people cannot inspect it easily.
The source code after cythonization and compilation is hidden. However, it seems like you can still inspect it quite easily when you import the module in a python console and use the dir magic method. Is there a way to prevent code inspection from happening?

The first thing to emphasise is that Cython isn't really designed to help you hide your code - it's more an unintentional consequence. If Cython could get a 1% speed-up by making the source code for a compiled module completely visible then it'd probably choose to do that rather than hide it (however that's pretty unlikely... I can't see a mechanism that'd happen by).
Having some things visible by dir is pretty unavoidable. Cython is trying to make things behave like Python as far as possible and Python works by looking names up in a module dictionary. All dir does is print the keys from the module dictionary. However there are some things you can try:
use cdef functions rather than def functions. cdef functions are only not callable (or visible) from Python. They have some limitations though - you can't use *args or **kwds as arguments for example. Some sub-notes:
you will probably need one entry-point that can be called from Python.
if you need to share cdef functions between modules then you will need to create a pxd file. shared cdef functions will appear in the __pyx_capi__ special attribute of your module where their name, C signature, and possibly function pointer will be able to anyone that looks.
Turning off the binding and embedsignature compiler directives will make regular def functions a little less easily introspected.
You can tag a cdef class with the cython.internal decorator. This makes it only easily accessible from within your module. Notes:
cdef classes aren't quite a substitute for Python classes (for example, all attributes must be pre-declared at compile-time),
If you return one of these classes to Python then the user can introspect it,
The class will leave some evidence that it exists in the names automatically generated pickle functions (visible in your module). You can turn that off with the auto_pickle compiler directive.

Related

How can I save a dynamically generated module and reimport them from file?

I have an application that dynamically generates a lot of Python modules with class factories to eliminate a lot of redundant boilerplate that makes the code hard to debug across similar implementations and it works well except that the dynamic generation of the classes across the modules (hundreds of them) takes more time to load than simply importing from a file. So I would like to find a way to save the modules to a file after generation (unless reset) then load from those files to cut down on bootstrap time for the platform.
Does anyone know how I can save/export auto-generated Python modules to a file for re-import later. I already know that pickling and exporting as a JSON object won't work because they make use of thread locks and other dynamic state variables and the classes must be defined before they can be pickled. I need to save the actual class definitions, not instances. The classes are defined with the type() function.
If you have ideas of knowledge on how to do this I would really appreciate your input.

You’re basically asking how to write a compiler whose input is a module object and whose output is a .pyc file. (One plausible strategy is of course to generate a .py and then byte-compile that in the usual fashion; the following could even be adapted to do so.) It’s fairly easy to do this for simple cases: the .pyc format is very simple (but note the comments there), and the marshal module does all of the heavy lifting for it. One point of warning that might be obvious: if you’ve already evaluated, say, os.getcwd() when you generate the code, that’s not at all the same as evaluating it when loading it in a new process.
The “only” other task is constructing the code objects for the module and each class: this requires concatenating a large number of boring values from the dis module, and will fail if any object encountered is non-trivial. These might be global/static variables/constants or default argument values: if you can alter your generator to produce modules directly, you can probably wrap all of these (along with anything else you want to defer) in function calls by compiling something like
my_global=(lambda: open(os.devnull,'w'))()
so that you actually emit the function and then a call to it. If you can’t so alter it, you’ll have to have rules to recognize values that need to be constructed in this fashion so that you can replace them with such calls.
Another detail that may be important is closures: if your generator uses local functions/classes, you’ll need to create the cell objects, perhaps via “fake” closures of your own:
def cell(x): return (lambda: x).__closure__[0]

Choose Python classes to instantiate at runtime based on either user input or on command line parameters

I am starting a new Python project that is supposed to run both sequentially and in parallel. However, because the behavior is entirely different, running in parallel would require a completely different set of classes than those used when running sequentially. But there is so much overlap between the two codes that it makes sense to have a unified code and defer the parallel/sequential behavior to a certain group of classes.
Coming from a C++ world, I would let the user set a Parallel or Serial class in the main file and use that as a template parameter to instantiate other classes at runtime. In Python there is no compilation time so I'm looking for the most Pythonic way to accomplish this. Ideally, it would be great that the code determines whether the user is running sequentially or in parallel to select the classes automatically. So if the user runs mpirun -np 4 python __main__.py the code should behave entirely different than when the user calls just python __main__.py. Somehow it makes no sense to me to have if statements to determine the type of an object at runtime, there has to be a much more elegant way to do this. In short, I would like to avoid:
if isintance(a, Parallel):
m = ParallelObject()
elif ifinstance(a, Serial):
m = SerialObject()
I've been reading about this, and it seems I can use factories (which somewhat have this conditional statement buried in the implementation). Yet, using factories for this problem is not an option because I would have to create too many factories.
In fact, it would be great if I can just "mimic" C++'s behavior here and somehow use Parallel/Serial classes to choose classes properly. Is this even possible in Python? If so, what's the most Pythonic way to do this?
Another idea would be to detect whether the user is running in parallel or sequentially and then load the appropriate module (either from a parallel or sequential folder) with the appropriate classes. For instance, I could have the user type in the main script:
from myPackage.parallel import *
or
from myPackage.serial import *
and then have the parallel or serial folders import all shared modules. This would allow me to keep all classes that differentiate parallel/serial behavior with the same names. This seems to be the best option so far, but I'm concerned about what would happen when I'm running py.test because some test files will load parallel modules and some other test files would load the serial modules. Would testing work with this setup?

You may want to check how a similar issue is solved in the stdlib: https://github.com/python/cpython/blob/master/Lib/os.py - it's not a 100% match to your own problem, nor the only possible solution FWIW, but you can safely assume this to be a rather "pythonic" solution.
wrt/ the "automagic" thing depending on execution context, if you decide to go for it, by all means make sure that 1/ both implementations can still be explicitely imported (like os.ntpath and os.posixpath) so they are truly unit-testable, and 2/ the user can still manually force the choice.
EDIT:
So if I understand it correctly, this file you points out imports modules depending on (...)
What it "depends on" is actually mostly irrelevant (in this case it's a builtin name because the target OS is known when the runtime is compiled, but this could be an environment variable, a command line argument, a value in a config file etc). The point was about both conditional import of modules with same API but different implementations while still providing direct explicit access to those modules.
So in a similar way, I could let the user type from myPackage.parallel import * and then in myPackage/init.py I could import all the required modules for the parallel calculation. Is this what you suggest?
Not exactly. I posted this as an example of conditional imports mostly, and eventually as a way to build a "bridge" module that can automagically select the appropriate implementation at runtime (on which basis it does so is up to you).
The point is that the end user should be able to either explicitely select an implementation (by explicitely importing the right submodule - serial or parallel and using it directly) OR - still explicitely - ask the system to select one or the other depending on the context.
So you'd have myPackage.serial and myPackage.parallel (just as they are now), and an additional myPackage.automagic that dynamically selects either serial or parallel. The "recommended" choice would then be to use the "automagic" module so the same code can be run either serial or parallel without the user having to care about it, but with still the ability to force using one or the other where it makes sense.
My fear is that py.test will have modules from parallel and serial while testing different files and create a mess
Why and how would this happen ? Remember that Python has no "process-global" namespace - "globals" are really "module-level" only - and that python's import is absolutely nothing like C/C++ includes.
import loads a module object (can be built directly from python source code, or from compiled C code, or even dynamically created - remember, at runtime a module is an object, instance of the module type) and binds this object (or attributes of this object) into the enclosing scope. Also, modules are garanteed (with a couple caveats, but those are to be considered as error cases) to be imported only once for a given process (and then cached) so importing the same module twice in a same process will yield the same object (IOW a module is a singleton).
All this means that given something like
# module A
def foo():
return bar(42)
def bar(x):
return x * 2
and
# module B
def foo():
return bar(33)
def bar(x):
return x / 2
It's garanteed that however you import from A and B, A.foo will ALWAYS call A.bar and NEVER call B.bar and B.foo will only ever call B.bar (unless you explicitely monkeyptach them of course but that's not the point).
Also, this means that within a module you cannot have access to the importing namespace (the module or function that's importing your module), so you cannot have a module depending on "global" names set by the importer.
To make a long story short, you really need to forget about C++ and learn how Python works, as those are wildly different languages with wildly different object models, execution models and idioms. A couple interesting reads are http://effbot.org/zone/import-confusion.htm and https://nedbatchelder.com/text/names.html
EDIT 2:
(about the 'automagic' module)
I would do that based on whether the user runs mpirun or just python. However, it seems it's not possible (see for instance this or this) in a portable way without a hack. Any ideas in that direction?
I've never ever had anything to do with mpi so I can't help with this - but if the general consensus is that there's no reliable portable way to detect this then obviously there's your answer.
This being said, simple stupid solutions are sometimes overlooked. In your case, explicitly setting an environment variable or passing a command-line switch to your main script would JustWork(tm), ie the user should for example use
SOMEFLAG=serial python main.py
vs
SOMEFLAG=parallel mpirun -np4 python main.py
or
python main.py serial
vs
mpirun -np4 python main.py parallel
(whichever works best for you needs - is the most easily portable).
This of course requires a bit more documentation and some more effort from the end-user but well...

I'm not really what you're asking here. Python classes are just (callable/instantiable) objects themselves, so you can of course select and use them conditionally. If multiple classes within multiple modules are involved, you can also make the imports conditional.
if user_says_parallel:
from myPackage.parallel import ParallelObject
ObjectClass = ParallelObject
else:
from myPackage.serial import SerialObject
ObjectClass = SerialObject
my_abstract_object = ObjectClass()
If that's very useful depends on your classes and the effort it takes to make sure they have the same API so they're compatible when replacing each other. Maybe even inheritance à la ParallelObject => SerialObject is possible, or at least a common (virtual) base class to put all the shared code. But that's just the same as in C++.

Dynamically create a new type in python

I am writing C++ plugins that exposes 'properties' that may be of various types. In short, a property in this context is a variable, together with some metadata. The properties are mainly holding values of simple types like, ints, doubles etc, but can also be user defined types/structures.
In C++, clients of the plugins can get a pointer to a property, and then manipulate it.
From python, on the other hand, simple wrapper functions can return a handle to a propertys value, e.g.
aValue = ctypes.c_double()
getPropertyValue(handleToAProperty, ctypes.byref(aValue))
which works fine for POD's.
But how to deal with a user defined type?
E.g. if a user defined type is:
class myType
{
private:
double a;
public:
void setA(double a);
double getA();
}
Is there any way to dynamically create something(??) on the client side that allows the client to retrieve and set the value of the member variable 'a' in the class above?
I could write a simple wrapper class, I assume but would be happy if one could get around that.

The class information, and not even the public methods information, is automatically included in a dynamically compiled C or C++ file.
The C/C++ compilers know how to validate the public functions in a compiled file, and even do parameter/return value type checking, because they verify this information at compile time in the C header files.
Even the Boost wrapper builders have to import the C/C++ header files and are able to access class members, etc... due to information fetched at compile time.
So, it can be possible to be able to findout C/C++ data structures and method signatures dynamically in a running Python program: but you have to keep in mind that ordinarily this information is not even in the library files. So, any approach you try will have to be able to scan the headerclass definition c++ files at program runtime, so that it can perform the needed introspection.
Knowing that, now, you have to search about C++/C header files parsing modules for Python, and check if any of them suit your needs - this home page of one such projects do list other similar projects in the introductory documentation, so you may find one that suits you:
https://github.com/albertz/PyCParser (note , though, that only C is supported).
Another technology that automatically exposes C classes (sic - it is an OO framework to pure C) to be used in other languages, Python included, is gobject/gobject introspection. It should work fine with Python3, but I don't have any idea if it a) works with C++ classes, b) allow the introspection without you having to modify your classes to use gobject stuff itself.

a libray in BOOST call Boost.Python has boost::python::object::ptr, which returns a PyObject* -- so that Python can operate c++ class, is this what you want?

How does eclipse's pydev do code completion?

Does anyone know how pydev determines what to use for code completion? I'm trying to define a set of classes specifically to enable code completion. I've tried using __new__ to set __dict__ and also __slots__, but neither seems to get listed in pydev autocomplete.
I've got a set of enums I want to list in autocomplete, but I'd like to set them in a generator, not hardcode them all for each class.
So rather than
class TypeA(object):
ValOk = 1
ValSomethingSpecificToThisClassWentWrong = 4
def __call__(self):
return 42
I'd like do something like
def TYPE_GEN(name, val, enums={}):
def call(self):
return val
dct = {}
dct["__call__"] = call
dct['__slots__'] = enums.keys()
for k, v in enums.items():
dct[k] = v
return type(name, (), dct)
TypeA = TYPE_GEN("TypeA",42,{"ValOk":1,"ValSomethingSpecificToThisClassWentWrong":4})
What can I do to help the processing out?
edit:
The comments seem to be about questioning what I am doing. Again, a big part of what I'm after is code completion. I'm using python binding to a protocol to talk to various microcontrollers. Each parameter I can change (there are hundreds) has a name conceptually, but over the protocol I need to use its ID, which is effectively random. Many of the parameters accept values that are conceptually named, but are again represented by integers. Thus the enum.
I'm trying to autogenerate a python module for the library, so the group can specify what they want to change using the names instead of the error prone numbers. The __call__ property will return the id of the parameter, the enums are the allowable values for the parameter.
Yes, I can generate the verbose version of each class. One line for each type seemed clearer to me, since the point is autocomplete, not viewing these classes.

Ok, as pointed, your code is too dynamic for this... PyDev will only analyze your own code statically (i.e.: code that lives inside your project).
Still, there are some alternatives there:
Option 1:
You can force PyDev to analyze code that's in your library (i.e.: in site-packages) dynamically, in which case it could get that information dynamically through a shell.
To do that, you'd have to create a module in site-packages and in your interpreter configuration you'd need to add it to the 'forced builtins'. See: http://pydev.org/manual_101_interpreter.html for details on that.
Option 2:
Another option would be putting it into your predefined completions (but in this case it also needs to be in the interpreter configuration, not in your code -- and you'd have to make the completions explicit there anyways). See the link above for how to do this too.
Option 3:
Generate the actual code. I believe that Cog (http://nedbatchelder.com/code/cog/) is the best alternative for this as you can write python code to output the contents of the file and you can later change the code/rerun cog to update what's needed (if you want proper completions without having to put your code as it was a library in PyDev, I believe that'd be the best alternative -- and you'd be able to grasp better what you have as your structure would be explicit there).
Note that cog also works if you're in other languages such as Java/C++, etc. So, it's something I'd recommend adding to your tool set regardless of this particular issue.

Fully general code completion for Python isn't actually possible in an "offline" editor (as opposed to in an interactive Python shell).
The reason is that Python is too dynamic; basically anything can change at any time. If I type TypeA.Val and ask for completions, the system had to know what object TypeA is bound to, what its class is, and what the attributes of both are. All 3 of those facts can change (and do; TypeA starts undefined and is only bound to an object at some specific point during program execution).
So the system would have to know st what point in the program run do you want the completions from? And even if there were some unambiguous way of specifying that, there's no general way to know what the state of everything in the program is like at that point without actually running it to that point, which you probably don't want your editor to do!
So what pydev does instead is guess, when it's pretty obvious. If you have a class block in a module foo defining class Bar, then it's a safe bet that the name Bar imported from foo is going to refer to that class. And so you know something about what names are accessible under Bar., or on an object created by obj = Bar(). Sure, the program could be rebinding foo.Bar (or altering its set of attributes) at runtime, or could be run in an environment where import foo is hitting some other file. But that sort of thing happens rarely, and the completions are useful in the common case.
What that means though is that you basically lose completions whenever you use "too much" of Python's dynamic language flexibility. Defining a class by calling a function is one of those cases. It's not ready to guess that TypeA has names ValOk and ValSomethingSpecificToThisClassWentWrong; after all, there's presumably lots of other objects that result from calls to TYPE_GEN, but they all have different names.
So if your main goal is to have completions, I think you'll have to make it easy for pydev and write these classes out in full. Of course, you could use similar code to generate the python files (textually) if you wanted. It looks though like there's actually more "syntactic overhead" of defining these with dictionaries than as a class, though; you're writing "a": b, per item rather than a = b. Unless you can generate these more systematically or parse existing definition files or something, I think I'd find the static class definition easier to read and write than the dictionary driving TYPE_GEN.

The simpler your code, the more likely completion is to work. Would it be reasonable to have this as a separate tool that generates Python code files containing the class definitions like you have above? This would essentially be the best of both worlds. You could even put the name/value pairs in a JSON or INI file or what have you, eliminating the clutter of the methods call among the name/value pairs. The only downside is needing to run the tool to regenerate the code files when the codes change, but at least that's an automated, simple process.
Personally, I would just go with making things more verbose and writing out the classes manually, but that's just my opinion.
On a side note, I don't see much benefit in making the classes callable vs. just having an id class variable. Both require knowing what to type: TypeA() vs TypeA.id. If you want to prevent instantiation, I think throwing an exception in __init__ would be a bit more clear about your intentions.

What are the advantages and disadvantages of the require vs. import methods of loading code?

Ruby uses require, Python uses import. They're substantially different models, and while I'm more used to the require model, I can see a few places where I think I like import more. I'm curious what things people find particularly easy — or more interestingly, harder than they should be — with each of these models.
In particular, if you were writing a new programming language, how would you design a code-loading mechanism? Which "pros" and "cons" would weigh most heavily on your design choice?

The Python import has a major feature in that it ties two things together -- how to find the import and under what namespace to include it.
This creates very explicit code:
import xml.sax
This specifies where to find the code we want to use, by the rules of the Python search path.
At the same time, all objects that we want to access live under this exact namespace, for example xml.sax.ContentHandler.
I regard this as an advantage to Ruby's require. require 'xml' might in fact make objects inside the namespace XML or any other namespace available in the module, without this being directly evident from the require line.
If xml.sax.ContentHandler is too long, you may specify a different name when importing:
import xml.sax as X
And it is now avalable under X.ContentHandler.
This way Python requires you to explicitly build the namespace of each module. Python namespaces are thus very "physical", and I'll explain what I mean:
By default, only names directly defined in the module are available in its namespace: functions, classes and so.
To add to a module's namespace, you explicitly import the names you wish to add, placing them (by reference) "physically" in the current module.
For example, if we have the small Python package "process" with internal submodules machine and interface, and we wish to present this as one convenient namespace directly under the package name, this is and example of what we could write in the "package definition" file process/__init__.py:
from process.interface import *
from process.machine import Machine, HelperMachine
Thus we lift up what would normally be accessible as process.machine.Machine up to process.Machine. And we add all names from process.interface to process namespace, in a very explicit fashion.
The advantages of Python's import that I wrote about were simply two:
Clear what you include when using import
Explicit how you modify your own module's namespace (for the program or for others to import)

A nice property of require is that it is actually a method defined in Kernel. Thus you can override it and implement your own packaging system for Ruby, which is what e.g. Rubygems does!
PS: I am not selling monkey patching here, but the fact that Ruby's package system can be rewritten by the user (even to work like python's system). When you write a new programming language, you cannot get everything right. Thus if your import mechanism is fully extensible (into totally all directions) from within the language, you do your future users the best service. A language that is not fully extensible from within itself is an evolutionary dead-end. I'd say this is one of the things Matz got right with Ruby.

Python's import provides a very explicit kind of namespace: the namespace is the path, you don't have to look into files to know what namespace they do their definitions in, and your file is not cluttered with namespace definitions. This makes the namespace scheme of an application simple and fast to understand (just look at the source tree), and avoids simple mistakes like mistyping a namespace declaration.
A nice side effect is every file has its own private namespace, so you don't have to worry about conflicts when naming things.
Sometimes namespaces can get annoying too, having things like some.module.far.far.away.TheClass() everywhere can quickly make your code very long and boring to type. In these cases you can import ... from ... and inject bits of another namespace in the current one. If the injection causes a conflict with the module you are importing in, you can simply rename the thing you imported: from some.other.module import Bar as BarFromOtherModule.
Python is still vulnerable to problems like circular imports, but it's the application design more than the language that has to be blamed in these cases.
So python took C++ namespace and #include and largely extended on it. On the other hand I don't see in which way ruby's module and require add anything new to these, and you have the exact same horrible problems like global namespace cluttering.

Disclaimer, I am by no means a Python expert.
The biggest advantage I see to require over import is simply that you don't have to worry about understanding the mapping between namespaces and file paths. It's obvious: it's just a standard file path.
I really like the emphasis on namespacing that import has, but can't help but wonder if this particular approach isn't too inflexible. As far as I can tell, the only means of controlling a module's naming in Python is by altering the filename of the module being imported or using an as rename. Additionally, with explicit namespacing, you have a means by which you can refer to something by its fully-qualified identifier, but with implicit namespacing, you have no means to do this inside the module itself, and that can lead to potential ambiguities that are difficult to resolve without renaming.
i.e., in foo.py:
class Bar:
def myself(self):
return foo.Bar
This fails with:
Traceback (most recent call last):
File "", line 1, in ?
File "foo.py", line 3, in myself
return foo.Bar
NameError: global name 'foo' is not defined
Both implementations use a list of locations to search from, which strikes me as a critically important component, regardless of the model you choose.
What if a code-loading mechanism like require was used, but the language simply didn't have a global namespace? i.e., everything, everywhere must be namespaced, but the developer has full control over which namespace the class is defined in, and that namespace declaration occurs explicitly in the code rather than via the filename. Alternatively, defining something in the global namespace generates a warning. Is that a best-of-both-worlds approach, or is there an obvious downside to it that I'm missing?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.