Decompile an imported module (e.g. with uncompyle2)

Decompile an imported module (e.g. with uncompyle2) - python

my task is to export an imported (compiled) module loaded from a container.
I have a Py.-Script importing a module. Upon using print(module1) I can see that it is a compiled python (pyc) file, loaded from an archive. As I cannot access the archive, my idea was to import the module and have it decompiled with uncompyle2.
This is my minimum code:
import os, sys
import uncompyle2
import module1
with open("module1.py", "wb") as fileobj:
uncompyle2.uncompyle_file(module1, fileobj)
However, this prints my an error. If I substitute module1 in the uncompyle argument with the actual path, it does not make a difference. I tried the code snippet successfully when the pyc-file not loaded from a container but rather a single file in a directory and it worked.
Error:
Traceback (most recent call last):
File "C:\....\run.py", line 64, in <module>
uncompyle2.uncompyle_file(module1, fileobj)
File "C:\....\Python\python-2.7.6\lib\site-packages\uncompyle2\__init__.py", line 124, in uncompyle_file
version, co = _load_module(filename)
File "C:\.....\Python\python-2.7.6\lib\site-packages\uncompyle2\__init__.py", line 67, in _load_module
fp = open(filename, 'rb')
TypeError: coercing to Unicode: need string or buffer, module found
Does anyone know where I am going wrong?

You are going wrong with your initial assumption:
As I cannot access the archive, my idea was to import the module and
have it decompiled with uncompyle2.
Uncompiling an already loaded module is unfortunately not possible. A loaded Python module is not a mirror of the on-disk representation of a .pyc file. Instead, it is a collection of objects created as a side effect of executing the code in the .pyc. Once the code has been executed, its byte code is discarded and it (in the general case) cannot be reconstructed.
As an example, consider the following Python module:
import gtk
w = gtk.Window(gtk.WINDOW_TOPLEVEL)
w.add(gtk.Label("A quick brown fox jumped over the lazy dog"))
w.show_all()
Importing this module inside an application that happens to run a GTK main loop will pop up a window with some text as a side effect. The module will have a dict with two entries, gtk pointing to the gtk module, and w pointing to an already created GTK window. There is no hint there how to create another GTK window of the sort, nor how to create another such module. (Remember that the object created might have been arbitrarily complex and that its creation could be a very involved process.)
You might ask, then, if that is so, then what is the content of the pyc file? How did it get loaded the first time? The answer is that the pyc file contains an on-disk rendition of the byte-compiled code in the module, ready for execution. Creating a pyc file is roughly equivalent to doing something like:
import marshal
def make_pyc(source_code, filename):
compiled = compile(source_code, filename, "exec")
serialized = marshal.dumps(compiled)
with open(filename, "wb") as out:
out.write(serialized)
# for example:
make_pyc("import gtk\nw = gtk.Window(gtk.WINDOW_TOPLEVEL)...",
"somefile.pyc", "exec")
On the other hand, loading a compiled module is approximately equivalent to:
import sys, marshal, imp
def load_pyc(modname):
with open(modname + ".pyc", "rb") as in_:
serialized = in_.read()
compiled = marshal.loads(serialized)
module = sys.modules[modname] = imp.new_module(modname)
exec compiled in module.__dict__
load_pyc("somefile")
Note how, once the code has been executed with the exec statement, the string and deserialized bytecode is no longer used and will be swept up by the garbage collector. The only remaining effect of the pyc having been loaded is the presence of a new module with living functions, classes, and other objects that are impossible to serialize, such as references to open files, network connections, OpenGL canvases, or GTK windows.
What modules like uncompyle2 do is the inverse of the compile function. You must have the actual code of the module (either serialized as in a pyc file or deserialized code object as in the compiled variable in the snippets above), from which uncompyle2 will produce a fairly faithful representation of the original source.

pass the filename string first and then the file object to write to:
with open("out.txt","w") as f:
uncompyle2.uncompyle_file('path_to.pyc',f)
You can see the output:
with open("/home/padraic/test.pyc","rb") as f:
print(f.read())
with open("out.txt","r+") as f:
uncompyle2.uncompyle_file('/home/padraic/test.pyc',f)
f.seek(0)
print(f.read())
Output:
�
d�ZdS(cCs dGHdS(Nshello world((((stest.pytfoosN(R(((stest.pyt<module>s
#Embedded file name: test.py
def foo():
print 'hello world'

Related

Python ValueError with two pyc scripts

I have two Python scripts, let's call them parent.py, and child.py. parent.py inherits code from child.py.
I wanted to obfuscate the code against casual viewers (my audience), so I encoded them both as pyc files, then changed the extensions back to .py. However, whenever I run parent.py, it says ValueError: source code string cannot contain null bytes.
I'm able to fix this error by changing child.py to child.pyc, however, I'd rather keep all files as .py. Does anyone know how I'd solve this?

You could move the '.py' extension from importlib.machinery.SOURCE_SUFFIXES to importlib.machinery.BYTECODE_SUFFIXES. This would not allow you to run the parent bytecode directly from python with a .py extension. Instead, you could write a small driver that contained python code that would give no hint of the contents of your scripts:
from importlib.machinery import SourcelessFileLoader, SOURCE_SUFFIXES, BYTECODE_SUFFIXES
from importlib.util import spec_from_loader, module_from_spec
from os.path import abspath, dirname, join
SOURCE_SUFFIXES.remove('.py')
BYTECODE_SUFFIXES.append('.py')
path = dirname(abspath(__file__))
loader = SourcelessFileLoader('__main__', join(path, 'parent.py'))
spec = spec_from_loader(loader.name, loader)
script = module_from_spec(spec)
loader.exec_module(script)
You could call this driver script directly, assuming it was in the same folder as your bytecode.

reload module with pyximport?

I have a python program that loads quite a bit of data before running. As such, I'd like to be able to reload code without reloading data. With regular python, importlib.reload has been working fine. Here's an example:
setup.py:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
extensions = [
Extension("foo.bar", ["foo/bar.pyx"],
language="c++",
extra_compile_args=["-std=c++11"],
extra_link_args=["-std=c++11"])
]
setup(
name="system2",
ext_modules=cythonize(extensions, compiler_directives={'language_level' : "3"}),
)
foo/bar.py
cpdef say_hello():
print('Hello!')
runner.py:
import pyximport
pyximport.install(reload_support=True)
import foo.bar
import subprocess
from importlib import reload
if __name__ == '__main__':
def reload_bar():
p = subprocess.Popen('python setup.py build_ext --inplace',
shell=True,
cwd='<your directory>')
p.wait()
reload(foo.bar)
foo.bar.say_hello()
But this doesn't seem to work. If I edit bar.pyx and run reload_bar I don't see my changes. I also tried pyximport.build_module() with no luck -- the module rebuilt but didn't reload. I'm running in a "normal" python shell, not IPython if it makes a difference.

I was able to get a solution working for Python 2.x a lot easier than Python 3.x. For whatever reason, Cython seems to be caching the shareable object (.so) file it imports your module from, and even after rebuilding and deleting the old file while running, it still imports from the old shareable object file. However, this isn't necessary anyways (when you import foo.bar, it doesn't create one), so we can just skip this anyways.
The largest problem was that python kept a reference to the old module, even after reloading. Normal python modules seem to work find, but not anything cython related. To fix this, I run execute two statements in place of reload(foo.bar)
del sys.modules['foo.bar']
import foo.bar
This successfully (though probably less efficiently) reloads the cython module. The only issue that remains in in Python 3.x running that subprocess creates a problematic shareable objects. Instead, skip that all together and let the import foo.bar work its magic with the pyximporter module, and recompile for you. I also added an option to the pyxinstall command to specify the language level to match what you've specified in the setup.py
pyximport.install(reload_support=True, language_level=3)
So all together:
runner.py
import sys
import pyximport
pyximport.install(reload_support=True, language_level=3)
import foo.bar
if __name__ == '__main__':
def reload_bar():
del sys.modules['foo.bar']
import foo.bar
foo.bar.say_hello()
input(" press enter to proceed ")
reload_bar()
foo.bar.say_hello()
Other two files remained unchanged
Running:
Hello!
press enter to proceed
-replace "Hello!" in foo/bar.pyx with "Hello world!", and press Enter.
Hello world!

Cython-extensions are not the usual python-modules and thus the behavior of the underlying OS shimmers through. This answer is about Linux, but also other OSes have similar behavior/problems (ok, Windows wouldn't even allow you to rebuild the extension).
A cython-extension is a shared object. When importing, CPython opens this shared object via ldopen and calls the init-function, i.e. PyInit_<module_name> in Python3, which among other things registers the functions/functionality provided by the extension.
Is a shared-object loaded, we no longer can unload it, because there might be some Python objects alive, which would then have dangling pointers instead of function-pointers to the functionality from the original shared-object. See for example this CPython-issue.
Another important thing: When ldopen loads a shared object with the same path as one already loaded shared object, it will not read it from the disc, but just reuse the already loaded version - even if there is a different version on the disc.
And this is the problem with our approach: As long as the resulting shared object has the same name as the old one, you will never get to see the new functionality in the interpreter without restarting it.
What are your options?
A: Use pyximport with reload_support=True
Let's assume your Cython (foo.pyx) module looks as follows:
def doit():
print(42)
# called when loaded:
doit()
Now import it with pyximport:
>>> import pyximport
>>> pyximport.install(reload_support=True)
>>> import foo
42
>>> foo.doit()
42
foo.pyx was built and loaded (we can see, it prints 42 while loading, as expected). Let's take a look at the file of foo:
>>> foo.__file__
'/home/XXX/.pyxbld/lib.linux-x86_64-3.6/foo.cpython-36m-x86_64-linux-gnu.so.reload1'
You can see the additional reload1-suffix compared to the case built with reload_support=False. Seeing the file-name, we also verify that there is no other foo.so lying in the path somewhere and being wrongly loaded.
Now, let's change 42 to 21 in the foo.pyx and reload the file:
>>> import importlib
>>> importlib.reload(foo)
21
>>> foo.doit()
42
>>> foo.__file__
'/home/XXX/.pyxbld/lib.linux-x86_64-3.6/foo.cpython-36m-x86_64-linux-gnu.so.reload2'
What happened? pyximport built an extension with a different prefix (reload2) and loaded it. It was successful, because the name/path of the new extension is different due to the new prefix and we can see 21 printed while loaded.
However, foo.doit() is still the old version! If we look up the reload-documentation, we see:
When reload() is executed:
Python module’s code is recompiled and the module-level code re-executed,
defining a new set of objects which are bound to names in
the module’s dictionary by reusing the loader which originally loaded
the module. The init function of extension modules is not called a
second time.
init (i.e. PyInit_<module_name>) isn't executed for extension (that means also for Cython-extensions), thus PyModuleDef_Init with foo-module-definition isn't called and one is stuck with the old definition bound to foo.doit. This behavior is sane, because for some extension, init-function isn't supposed to be called twice.
To fix it we have to import the module foo once again:
>>> import foo
>>> foo.doit()
21
Now foo is reloaded as good as it gets - which means there might be still old objects being in use. But I trust you to know what you do.
B: Change the name of your extensions with every version
Another strategy could be to build the module foo.pyx as foo_prefix1.so and then foo_prefix2.so and so on and load it as
>>> import foo_perfixX as foo
This is strategy used by %%cython-magic in IPython, which uses sha1-hash of the Cython-code as prefix.
One can emulate IPython's approach using imp.load_dynamic (or its implementation with help of importlib, as imp is deprecated):
from importlib._bootstrap _load
def load_dynamic(name, path, file=None):
"""
Load an extension module.
"""
import importlib.machinery
loader = importlib.machinery.ExtensionFileLoader(name, path)
# Issue #24748: Skip the sys.modules check in _load_module_shim;
# always load new extension
spec = importlib.machinery.ModuleSpec(
name=name, loader=loader, origin=path)
return _load(spec)
And now putting so-files e.g. into different folders (or adding some suffix), so dlopen sees them as different from previous version we can use it:
# first argument (name="foo") tells how the init-function
# of the extension (i.e. `PyInit_<module_name>`) is called
foo = load_dynamic("foo", "1/foo.cpython-37m-x86_64-linux-gnu.so")
# now foo has new functionality:
foo = load_dynamic("foo", "2/foo.cpython-37m-x86_64-linux-gnu.so")
Even if reloading and reloading of extension in particular is kind of hacky, for prototyping purposes I would probably go with pyximport-solution... or use IPython and %%cython-magic.

Execute bytecode .pyc from python code?

I have a bytecode document that declares functions and a logo. I also have a .py file where I call the bytecode to output the logo and strings in the functions. How do I go about actually executing the bytecode? I was able to dissemble it and see the assembly code. How can I actually run it?
question.py
import dis
import logo
def work_here():
# execute the bytecode
def main():
work_here()
if __name__ == '__main__':
main()

Try something like:
import dis
code = 'some byte code'
b_code = dis.Bytecode(code)
exec(b.codeobj)

To import a .pyc file, you just do the same thing you do with a .py file: import spam will find an appropriately-placed spam.pyc (or rather, something like __pycache__/spam.cpython-36.pyc) just as it will find an appropriately-placed spam.py. Its top-level code gets run, any functions and classes get defined so you can call them, etc., exactly the same as with a .py file; the only difference is that there isn't source text to show for things like tracebacks or debugger stepping.
If you want to programmatically import a .pyc file by explicit path, or execute one without importing it, you again do the same thing you do with a .py file.
Look at the Examples in importlib. For example:
path = 'bytecoderepo/myfile.pyc'
spec = importlib.util.spec_from_file('myfile', path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
And now, the code in bytecoderepo/myfile.pyc has been executed, and the resulting module is available in the variable mod, but it isn't in sys.modules or stored as a global.
If you actually need to dig into the .pyc format and, e.g., extract the bytecode of some function so you can exec it (or build a function object out of it) without executing the main module code, the details are only documented in the source, and subject to change between Python versions. Start with importlib; being able to (validate and) skip over the header and marshal.loads the body may be as far as you need to learn, but probably not (since ultimately, that's what the module loader already does for you in the sample code above, so if that's not good enough, you need to get deeper into the internals).

Can I import a built-in module twice in both my script and custom module

Is there a downside to importing the same built-in module in both my script and custom module?
I have a script that: imports my custom module and imports the built-in csv module to open a csv file and append any necessary content to a list.
I then have a method in my custom module in which i pass a path, filename and list and writes a csv, but I have to import the csv module again (in my module).
I do not understand what happens when I import the csv module twice so I wanted to know if there is a more uniformed way of doing what I'm doing or if this is ok.

No, there is no downside. Importing a module does two things:
If not yet in memory, load the module, storing the resulting object in sys.modules.
Bind names to either the module object (import modulename) or to attributes of the module object (from modulename import objectname).
Additional imports only execute step 2, as the module is already loaded.
See The import system in the Python reference documentation for the nitty gritty details:
The import statement combines two operations; it searches for the named module, then it binds the results of that search to a name in the local scope.

The short answer is no, there is no downside.
That being said, it may be helpful to understand what imports mean, particularly for anyone new to programming or coming from a different language background.
I imagine your code looks something like this:
# my_module.py
import os
import csv
def bar(path, filename, list):
full_path = os.path.join(path, filename)
with open(full_path, 'w') as f:
csv_writer = csv.writer
csv_writer.writerows(list)
and
# my_script.py
import csv
import my_module
def foo(path):
contents = []
with open(path, 'r') as f:
csv_reader = csv.reader(f)
for row in csv_reader:
contents.append(row)
As a high-level overview, when you do an import in this manner, Python determines whether the module has already been imported. If not, then it searches the Python path to determine where the imported module lives on the file system, then it loads the imported module's code into memory and executes it. The interpreter takes all objects that are created during the execution of the imported module and makes them attributes on a new module object that the interpreter creates. Then the interpreter stores this module object into a dictionary-like structure that maps the module name to the module object. Finally, the interpreter brings the imported module's name into the importing module's scope.
This has some interesting consequences. For example, it means that you could simply use my_module.csv to access the csv module within my_script.py. It also means that importing csv in both is trivial and is probably the clearest thing you can do.
One very interesting consequence is that if any statements that get executed during import have any side effects, those side effects will only happen when the module is first loaded by the interpreter. For example, suppose you had two modules a.py and b.py with the following code:
# a.py
print('hello world')
# b.py
print('goodbye world')
import a
If you run import a followed by import b then you will see
>>> import a
hello world
>>> import b
goodbye world
>>>
However, if you import in the opposite order, you get this:
>>> import b
goodbye world
hello world
>>> import a
>>>
Anyway, I think I've rambled enough and I hope I've adequately answered the question while giving some background. If this is at all interesting, I'd recommend Allison Kaptur's PyCon 2014 talk about import.

You can import the same module in separate files (custom modules) as far as I know. Python keeps track of already imported modules and knows how to resolve a second import.

Import module stored in a cStringIO data structure vs. physical disk file

Is there a way to import a Python module stored in a cStringIO data structure vs. physical disk file?
It looks like "imp.load_compiled(name, pathname[, file])" is what I need, but the description of this method (and similar methods) has the following disclaimer:
Quote: "The file argument is the byte-compiled code file, open for reading in binary mode, from the beginning. It must currently be a real file object, not a user-defined class emulating a file." [1]
I tried using a cStringIO object vs. a real file object, but the help documentation is correct - only a real file object can be used.
Any ideas on why these modules would impose such a restriction or is this just an historical artifact?
Are there any techniques I can use to avoid this physical file requirement?
Thanks,
Malcolm
[1] http://docs.python.org/library/imp.html#imp.load_module

Something like this perhaps?
import types
import sys
src = """
def hello(who):
print 'hello', who
"""
def module_from_text(modulename, src):
if modulename in sys.modules:
module = sys.modules[modulename]
else:
module = sys.modules[modulename] = types.ModuleType(modulename)
exec compile(src, '<no-file>', 'exec') in module.__dict__
return module
module_from_text('flup', src)
import flup
flup.hello('world')
Which prints:
hello world
EDIT:
Evaluating code in this way treads nearish the realm of writing custom importers. It may be useful to look at PEP 302, and Doug Hellmann's PyMOTW: Modules and Imports.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decompile an imported module (e.g. with uncompyle2) - python

Related

Python ValueError with two pyc scripts

reload module with pyximport?

Execute bytecode .pyc from python code?

Can I import a built-in module twice in both my script and custom module

Import module stored in a cStringIO data structure vs. physical disk file

Categories

Resources