Is it easy to fully decompile python compiled(*.pyc) files?

Is it easy to fully decompile python compiled(*.pyc) files? - python

I wanted to know that how much easy is to decompile python byte code. I have made an application in python whose source I want to be secure. I am using py2exe which basically relies on python compiled files.
Does that secure the code?

Depends on your definition of "fully" (in "fully decompile")...;-). You won't easily get back the original Python source -- but getting the bytecode is easy, and standard library module dis exists exactly to make bytecode easily readable (though it's still bytecode, not full Python source code;-).

Compiling .pyc does not secure the code. They are easily read. See How do I protect Python code?

If you search online you can find decompilers for Python bytecode: there's a free version for downloading but which only handles bytecode up to Python 2.3, and an online service which will decompile up to version 2.6.
There don't appear to be any decompilers yet for more recent versions of Python bytecode, but that's almost certainly just because nobody has felt the need to write one rather than any fundamental difficulty with the bytecode itself.
Some people have tried to protect Python bytecode by modifying the interpreter: there's no particular reason why you can't compile your own interpreter with the different values used for the bytecode: that will prevent simple examination of the code with import dis, but won't stand up long to any determined attack and it all costs money that code be better put into improving the program itself.
In short, if you want to protect your program then use the law to do it: use an appropriate software license and prosecute those who ignore it. Code is expensive to write, but the end result is rarely the valuable part of a software package: data is much more valuable.

Now we can say: VERY EASY decompile, and NOT secure.
For I have use uncompyle6 to decompile my (latest 3.8.0) Python code:
uncompyle6 utils.cpython-38.pyc > utils.py
and decompile effect is NEARLY PERFECT.
origin python vs decompiled python:
-> ALMOST same py code except no comments
Not forget, other can use unpy2exe to
Extract .pyc files from executables created with py2exe
so your method not secure.

Related

Are syntax errors in Python found at 'compile time' or 'runtime'? [duplicate]

From my understanding:
An interpreted language is a high-level language run and executed by an interpreter (a program which converts the high-level language to machine code and then executing) on the go; it processes the program a little at a time.
A compiled language is a high-level language whose code is first converted to machine-code by a compiler (a program which converts the high-level language to machine code) and then executed by an executor (another program for running the code).
Correct me if my definitions are wrong.
Now coming back to Python, I am bit confused about this. Everywhere you learn that Python is an interpreted language, but it's interpreted to some intermediate code (like byte-code or IL) and not to the machine code. So which program then executes the IM code? Please help me understand how a Python script is handled and run.

First off, interpreted/compiled is not a property of the language but a property of the implementation. For most languages, most if not all implementations fall in one category, so one might save a few words saying the language is interpreted/compiled too, but it's still an important distinction, both because it aids understanding and because there are quite a few languages with usable implementations of both kinds (mostly in the realm of functional languages, see Haskell and ML). In addition, there are C interpreters and projects that attempt to compile a subset of Python to C or C++ code (and subsequently to machine code).
Second, compilation is not restricted to ahead-of-time compilation to native machine code. A compiler is, more generally, a program that converts a program in one programming language into a program in another programming language (arguably, you can even have a compiler with the same input and output language if significant transformations are applied). And JIT compilers compile to native machine code at runtime, which can give speed very close to or even better than ahead of time compilation (depending on the benchmark and the quality of the implementations compared).
But to stop nitpicking and answer the question you meant to ask: Practically (read: using a somewhat popular and mature implementation), Python is compiled. Not compiled to machine code ahead of time (i.e. "compiled" by the restricted and wrong, but alas common definition), "only" compiled to bytecode, but it's still compilation with at least some of the benefits. For example, the statement a = b.c() is compiled to a byte stream which, when "disassembled", looks somewhat like load 0 (b); load_str 'c'; get_attr; call_function 0; store 1 (a). This is a simplification, it's actually less readable and a bit more low-level - you can experiment with the standard library dis module and see what the real deal looks like. Interpreting this is faster than interpreting from a higher-level representation.
That bytecode is either interpreted (note that there's a difference, both in theory and in practical performance, between interpreting directly and first compiling to some intermediate representation and interpret that), as with the reference implementation (CPython), or both interpreted and compiled to optimized machine code at runtime, as with PyPy.

The CPU can only understand machine code indeed. For interpreted programs, the ultimate goal of an interpreter is to "interpret" the program code into machine code. However, usually a modern interpreted language does not interpret human code directly because it is too inefficient.
The Python interpreter first reads the human code and optimizes it to some intermediate code before interpreting it into machine code. That's why you always need another program to run a Python script, unlike in C++ where you can run the compiled executable of your code directly. For example, c:\Python27\python.exe or /usr/bin/python.

The answer depends on what implementation of python is being used. If you are using lets say CPython (The Standard implementation of python) or Jython (Targeted for integration with java programming language)it is first translated into bytecode, and depending on the implementation of python you are using, this bycode is directed to the corresponding virtual machine for interpretation. PVM (Python Virtual Machine) for CPython and JVM (Java Virtual Machine) for Jython.
But lets say you are using PyPy which is another standard CPython implementation. It would use a Just-In-Time Compiler.

Yes, it is both compiled and interpreted language. Then why we generally call it as interpreted language?
see how it is both- compiled and interpreted?
First of all I want to tell that you will like my answer more if you are from the Java world.
In the Java the source code first gets converted to the byte code through javac compiler then directed to the JVM(responsible for generating the native code for execution purpose). Now I want to show you that we call the Java as compiled language because we can see that it really compiles the source code and gives the .class file(nothing but bytecode) through:
javac Hello.java -------> produces Hello.class file
java Hello -------->Directing bytecode to JVM for execution purpose
The same thing happens with python i.e. first the source code gets converted to the bytecode through the compiler then directed to the PVM(responsible for generating the native code for execution purpose). Now I want to show you that we usually call the Python as an interpreted language because the compilation happens behind the scene
and when we run the python code through:
python Hello.py -------> directly excutes the code and we can see the output provied that code is syntactically correct
# python Hello.py it looks like it directly executes but really it first generates the bytecode that is interpreted by the interpreter to produce the native code for the execution purpose.
CPython- Takes the responsibility of both compilation and interpretation.
Look into the below lines if you need more detail:
As I mentioned that CPython compiles the source code but actual compilation happens with the help of cython then interpretation happens with the help of CPython
Now let's talk a little bit about the role of Just-In-Time compiler in Java and Python
In JVM the Java Interpreter exists which interprets the bytecode line by line to get the native machine code for execution purpose but when Java bytecode is executed by an interpreter, the execution will always be slower. So what is the solution? the solution is Just-In-Time compiler which produces the native code which can be executed much more quickly than that could be interpreted. Some JVM vendors use Java Interpreter and some use Just-In-Time compiler. Reference: click here
In python to get around the interpreter to achieve the fast execution use another python implementation(PyPy) instead of CPython.
click here for other implementation of python including PyPy.

According to the official Python site, it's interpreted.
https://www.python.org/doc/essays/blurb/
Python is an interpreted, object-oriented, high-level programming language...
...
Since there is no compilation step ...
...
The Python interpreter and the extensive standard library are available...
...
Instead, when the interpreter discovers an error, it raises an
exception. When the program doesn't catch the exception, the
interpreter prints a stack trace.

Its a big confusion for people who just started working in python and the answers here are a little difficult to comprehend so i'll make it easier.
When we instruct Python to run our script, there are a few steps that Python carries out before our code actually starts crunching away:
It is compiled to bytecode.
Then it is routed to virtual machine.
When we execute some source code, Python compiles it into byte code. Compilation is a translation step, and the byte code is a low-level platform-independent representation of source code.
Note that the Python byte code is not binary machine code (e.g., instructions for an Intel chip).
Actually, Python translate each statement of the source code into byte code instructions by decomposing them into individual steps. The byte code translation is performed to speed execution.
Byte code can be run much more quickly than the original source code statements. It has.pyc extension and it will be written if it can write to our machine.
So, next time we run the same program, Python will load the .pyc file and skip the compilation step unless it's been changed. Python automatically checks the timestamps of source and byte code files to know when it must recompile. If we resave the source code, byte code is automatically created again the next time the program is run.
If Python cannot write the byte code files to our machine, our program still works. The byte code is generated in memory and simply discarded on program exit. But because .pyc files speed startup time, we may want to make sure it has been written for larger programs.
Let's summarize what happens behind the scenes.
When Python executes a program, Python reads the .py into memory, and parses it in order to get a bytecode, then goes on to execute. For each module that is imported by the program, Python first checks to see whether there is a precompiled bytecode version, in a .pyo or .pyc, that has a timestamp which corresponds to its .py file. Python uses the bytecode version if any. Otherwise, it parses the module's .py file, saves it into a .pyc file, and uses the bytecode it just created.
Byte code files are also one way of shipping Python codes. Python will still run a program if all it can find are.pyc files, even if the original .py source files are not there.
Python Virtual Machine (PVM)
Once our program has been compiled into byte code, it is shipped off for execution to Python Virtual Machine (PVM). The PVM is not a separate program. It need not be installed by itself. Actually, the PVM is just a big loop that iterates through our byte code instruction, one by one, to carry out their operations. The PVM is the runtime engine of Python. It's always present as part of the Python system. It's the component that truly runs our scripts. Technically it's just the last step of what is called the Python interpreter.

If ( You know Java ) {
Python code is converted into bytecode like java does.
That bytecode is executed again everytime you try to access it.
} else {
Python code is initially traslated into something called bytecode that is quite
close to machine language but not actual machine code
so each time we access or run it that bytecode is executed again
}

It really depends on the implementation of the language being used! There is a common step in any implementation, though: your code is first compiled (translated) to intermediate code - something between your code and machine (binary) code - called bytecode (stored into .pyc files). Note that this is a one-time step that will not be repeated unless you modify your code.
And that bytecode is executed every time you are running the program. How? Well, when we run the program, this bytecode (inside a .pyc file) is passed as input to a Virtual Machine (VM)1 - the runtime engine allowing our programs to be executed - that executes it.
Depending on the language implementation, the VM will either interpret the bytecode (in the case of CPython2 implementation) or JIT-compile3 it (in the case of PyPy4 implementation).
Notes:
1 an emulation of a computer system
2 a bytecode interpreter; the reference implementation of the language, written in C and Python - most widely used
3 compilation that is being done during the execution of a program (at runtime)
4 a bytecode JIT compiler; an alternative implementation to CPython, written in RPython (Restricted Python) - often runs faster than CPython

Almost, we can say Python is interpreted language. But we are using some part of one time compilation process in python to convert complete source code into byte-code like java language.

For newbies
Python automatically compiles your script to compiled code, so called byte code, before running it.
Running a script is not considered an import and no .pyc will be created.
For example, if you have a script file abc.py that imports another module xyz.py, when you run abc.py, xyz.pyc will be created since xyz is imported, but no abc.pyc file will be created since abc.py isn’t being imported.

Python(the interpreter) is compiled.
Proof: It won't even compile your code if it contains syntax error.
Example 1:
print("This should print")
a = 9/0
Output:
This should print
Traceback (most recent call last):
File "p.py", line 2, in <module>
a = 9/0
ZeroDivisionError: integer division or modulo by zero
Code gets compiled successfully. First line gets executed (print) second line throws ZeroDivisionError (run time error) .
Example 2:
print("This should not print")
/0
Output:
File "p.py", line 2
/0
^
SyntaxError: invalid syntax
Conclusion: If your code file contains SyntaxError nothing will execute as compilation fails.

As sone one already said, "interpreted/compiled is not a property of the language but a property of the implementation." Python can be used in interpretation mode as well as compilation mode. When you run python code directly from terminal or cmd then the python interpreter starts. Now if you write any command then this command will be directly interpreted. If you use a file containing Python code and running it in IDE or using a command prompt it will be compiled first the whole code will be converted to byte code and then it will run. So it depends on how we use it.

The python code you write is compiled into python bytecode, which creates file with extension .pyc. If compiles, again question is, why not compiled language.
Note that this isn't compilation in the traditional sense of the word. Typically, we’d say that compilation is taking a high-level language and converting it to machine code. But it is a compilation of sorts. Compiled in to intermediate code not into machine code (Hope you got it Now).
Back to the execution process, your bytecode, present in pyc file, created in compilation step, is then executed by appropriate virtual machines, in our case, the CPython VM
The time-stamp (called as magic number) is used to validate whether .py file is changed or not, depending on that new pyc file is created. If pyc is of current code then it simply skips compilation step.

Seems to be a case of semantics. I think most of us infer that the usual result of compiling is machine-code. With that in mind I say to myself that python is not compiled. I would be wrong though because compile really means convert to a lower level so converting from source to byte-code is also compiling.

In my opinion Python is put into an interpreter categoty because its designed to be capable of fully processing ( from python code to execution in cpu) individual python statement. I.e. you write one statement and you can execute it and if no errors then get the corresponding result.
Having an intermediate code (like bytecode) i believe doesnt make difference to categorize it overall as compiler. Though this component (intermediate code generation) is typically been part of compiler but it can also be used in interpreters. See wiki definition of interpreter https://en.m.wikipedia.org/wiki/Interpreter_(computing). Its a crucial piece to add efficiency in terms of execution speed. With cache its even more powerful so that if you havent changed code in current program scope you skip heavy processing steps like lexical, semantic analysis and even some of code optimization.

How can I wrap Python processes in an OSGi deployment

I need to integrate a body of Python code into an existing OSGi (Apache Felix) deployment.
I assume, or at least hope, that packages exist to help with this effort.
If it helps, the Python code is still relatively new and small, so can probably re re-architected to meet whatever constraints are needed. However, it must remain in Python, because of dependencies on third-party libraries.
What are suggested best practices?

The trick is to make this an extender, see 1 and 2. You want your Python code to be separate from the code that handles the interaction with the interpreter. So what you do is wrap the Python code and any native libraries in a bundle. This is trivial since it is just a zip file.
You then develop a bundle that listens to starting bundle (see the BundleTracker) that have python code. A manifest is often used but you can also look in a directory in the JAR. If you detect this code, you extract any native libraries and run the code in the interpeter of your choice.
If can use JYthon then that would be highly recommended. You can then carry the interpreter as an OSGi bundle that runs on the VM. If you need to use a native compiler your life is less rosy. You can rely on the environment to provide you with an interpreter but then why use OSGi in the first place. You basically lose the write once run anywhere advantage. You could go the full monty by creating bundles that contain Python installers for all platforms you support. Can be done, not even that hard, but a maintenance nightmare. Believe me, native code suck, it only does it a bit faster than Java.

Running python script without installed libraries

I have working Python script using scipy and numpy functions and I need to run it on the computer with installed Python but without modules scipy and numpy. How should I do that? Is .pyc the answer or should I do something more complex?
Notes:
I don't want to use py2exe. I am aware of it but it doesn't fit to the problem.
I have read, these questions (What is the difference between .py and .pyc files?, Python pyc files (main file not compiled?)) with obvious connection to this problem but since I am a physicist, not a programmer, I got totally lost.

It is not possible.
A pyc-file is nothing more than a python file compiled into byte-code. It does not contain any modules that this file imports!
Additionally, the numpy module is an extension written in C (and some Python). A substantial piece of it are shared libraries that are loaded into Python at runtime. You need those for numpy to work!

Python first "compiles" a program into bytecode, and then throws this bytecode through an interpreter.
So if your code is all Python code, you would be able to one-time generate the bytecode and then have the Python runtime use this. In fact I've seen projects such as this, where the developer has just looked through the bytecode spec, and implemented a bytecode parsing engine. It's very lightweight, so it's useful for e.g. "Python on a chip" etc.
Problem comes with external libraries not entirely written in Python, (e.g. numpy, scipy).
Python provides a C-API, allowing you to create (using C/C++ code) objects that appear to it as Python objects. This is useful for speeding things up, interacting with hardware, making use of C/C++ libs.

Take a look at Nuitka. If you'll be able to compile your code (not necessarily a possible or easy task), you'll get what you want.

Python portability issues

Basically, I am a Java programmer who wants to learn Python language. I want to clarify why some of python libaries are distributing using non-portable manner.
Let me explain my thoughts. If someone creates a regular library using Java he prepares 1 (one) JAR file which can be used on different platforms:
my-great-lib-1.2.4.jar
I can use this lib (the same file) on any version of Windows or Linux.
In contrast to Java, python libraries may look like this:
bsdiff4-1.1.4.win-amd64-py2.5.exe
bsdiff4-1.1.4.win-amd64-py2.6.exe
bsdiff4-1.1.4.win-amd64-py2.7.exe
bsdiff4-1.1.4.win-amd64-py3.2.exe
bsdiff4-1.1.4.win-amd64-py3.3.exe
bsdiff4-1.1.4.win32-py2.5.exe
bsdiff4-1.1.4.win32-py2.6.exe
bsdiff4-1.1.4.win32-py2.7.exe
bsdiff4-1.1.4.win32-py3.2.exe
bsdiff4-1.1.4.win32-py3.3.exe
See full list on page.
It looks very strange for me. Even 32bit and 64bit platforms require different installers. Installers! Why do I need an installer in order to use one library? Moreover, outlined installers are only for Windows. Each of them is bind to particular python version. Where is portability?
Could anyone explain a necessity of 10 different files above?

In general, Python libraries are portable across platforms. Problems appear between different major Python versions (3 introduced some big changes from 2, but 2.7 is backwards compatible with 2.6) or when you use C code for optimizing CPU intensive code. On Linux, compiling it yourself is not a problem, when you call pip install package, it will do it for you. The problem is on Windows, where it is much more difficult to compile a C program, especially because not everybody has a compiler. So, for Windows, packages that need something in C, you usually get an installer.
Also, installers are used because they set up everything nicely, look in the registry for the appropriate place to put everything, offer a standard way to uninstall them (the ones from Chrisopther Goelke's site can be removed using Add/Remove programs in Control Panel) and because that's the standard on Windows: most of the programs on Windows are installed via an exe, because it doesn't have a standard and widespread package manager.
All these libraries are then portable: you can use them from any platform, but installing them is what differs.

There are many complications. In Java where your code and then byte-code is interpreted by JVM, the inherent computer architecture do not play lot of role as long as your code is interpreted well by JVM. In fact, that is one of the primary reason Java got so popular because your code should only worry about rightly compiled by JVM.
However, in Python situation is different. I am trying to summarize some of the reason which I think is important in following lines:
The language itself is evolving (although it is long in the scenario if you think!) and changes are happening inside the language. New features are added and sometime, even some remodeling of language is done ( Python 2.x to Python 3.x)
Python relies heavily on its C extensions and so does the applications written in Python. If you write a python program and have some CPU intensive code, you can choose to write it in C. This also adds in the necessity of creating number of libraries for various distribution.

For one python versions jump around. In python 3, the syntax of some builtins completely changed. For example:
raw_input()
changed to:
input()
also, a lot of the standard library has changed even in the alpha of 3.4. As for the 32/64 bit question, I cannot fully answer. I know that certain platforms have trouble when trying to run 32/64, and that may be the point there.

What are the limitations of distributing .pyc files?

I've started working on a commercial application in Python, and I'm weighing my options for how to distribute the application.
Aside from the obvious (distribute sources with an appropriate commercial license), I'm considering distributing just the .pyc files without their corresponding .py sources. But I'm not familiar enough with Python's compatibility guarantees to know if this is even workable, much less whether it's a good idea or not.
Are .pyc files independent of the underlying OS? For example, would a .pyc file generated on a 64-bit Linux machine work on a 32-bit Windows machine?
I've found that .pyc file should be compatible across bugfix releases, but what about major and minor releases? For example, would a file generated with Python 3.1.5 be compatible with Python 3.2.x? Or would a .pyc file generated with Python 2.7.3 be compatible with a Python 3.x release?
Edit:
Primarily, I may have to appease stakeholders who are uncomfortable distributing sources. Distributing .pyc's without sources may give them some level of comfort, since it would require the extra step of decompiling to get at the sources, even if that step is somewhat trivial. Just enough of a barrier to keep honest people honest.

For example, would a file generated with Python 3.1.5 be compatible with Python 3.2.x?
No.
Or would a .pyc file generated with Python 2.7.3 be compatible with a Python 3.x release?
Doubly no.
I'm considering distributing just the .pyc files without their corresponding .py sources.
Python bytecode is high-level and trivially decompilable.

You certainly could distribute the .pyc files only. As Cat mentioned, no it would not be compatible with different major version of Python. It might prevent some people from viewing the source code, but the .pyc files are very easy to decompile. Basically if you can compile it, you can decompile it.
You could use a binary packager like py2exe / py2app / freeze. I've never tried them but someone could still decompile them if they wanted to.

As Cat said, pyc files are not cross version safe. Though what you're trying to hide from the users determines what you need to do.
As for source code, there is no good way to hide Python source code in a distributed application. If you just trying to hide specific details you could pack those into a C extension -- which would be much harder to decompile.
So if you're worried about code use, put a license attached to the code for no-use or translate the sections you don't want stolen to a compiled language. If you just want code to not be obviously Python, you can create a binary executable that wraps the Python code (though doesn't hide the actual details if someone extracts them from the file).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.