python bytecode compatibility - python

To what extent is python bytecode compatible between releases.
I'm not talking about python2.x to python3.x but say... Python33 to python34?
I am not after it for 'security' I use Cython to convert the bulk of a program to C, I do however use pyc file as a means to store some constants and pyc is preferable as it provides a file format that isn't easily changed unofficially. If someone wants something changed they can request via internal procedures
Such a pyc file only contains variables which are
Int,float,list,dict,string in stf python.
One class but it acts more as a container/struct.
Is this a big no or is this a try and see as some very basic python bytecode data is being stored

Python makes no guarantee about bytecode compatibility between versions. Don't rely on it.
In fact, a pyc file starts with a magic number that changes every time the marshalling code does, and python checks this number for compatibility. Since this code changes pretty much every version, so does the magic number. See Ned Batchelder's blog entry for details.
There are better ways of ensuring your files haven't been tampered with: checksums, for example.

Related

How to view the implementation of python's built-in functions in pycharm?

When I try to view the built-in function all() in PyCharm, I could just see "pass" in the function body. How to view the actual implementation so that I could know what exactly the built-in function is doing?
def all(*args, **kwargs): # real signature unknown
"""
Return True if bool(x) is True for all values x in the iterable.
If the iterable is empty, return True.
"""
pass
Assuming you’re using the usual CPython interpreter, all is a builtin function object, which just has a pointer to a compiled function statically linked into the interpreter (or libpython). Showing you the x86_64 machine code at that address probably wouldn’t be very useful to the vast majority of people.
Try running your code in PyPy instead of CPython. Many things that are builtins in CPython are plain old Python code in PyPy.1 Of course that isn’t always an option (e.g., PyPy doesn’t support 3.7 features yet, there are a few third-party extension modules that are still way too slow to use, it’s harder to build yourself if you’re on some uncommon platform…), so let’s go back to CPython.
The actual C source for that function isn’t too hard to find online. It’s in bltinmodule.c. But, unlike the source code to the Python modules in your standard library, you probably don’t have these files around. Even if you do have them, the only way to connect the binary to the source is through debugging output emitted when you compiled CPython from that source, which you probably didn’t do. But if you’re thinking that sounds like a great idea—it is. Build CPython yourself (you may want a Py_DEBUG build), and then you can just run it in your C source debugger/IDE and it can handle all the clumsy bits.
But if that sounds more scary than helpful, even though you can read basic C code and would like to find it…
How did I know where to find that code on GitHub? Well, I know where the repo is; I know the basic organization of the source into Python, Objects, Modules, etc.; I know how module names usually map to C source file names; I know that builtins is special in a few ways…
That’s all pretty simple stuff. Couldn’t you just program all that knowledge into a script, which you could then build a PyCharm plugin out of?
You can do the first 50% or so in a quick evening hack, and such things litter the shores of GitHub. But actually doing it right requires handling a ton of special cases, parsing some ugly C code, etc. And for anyone capable of writing such a thing, it’s easier to just lldb Python than to write it.
1. Also, even the things that are builtins are written in a mix of Python and a Python subset called RPython, which you might find easier to understand than C—then again, it’s often even harder to find that source, and the multiple levels that all look like Python can be hard to keep straight.

Does C has a "from-import"-like mechanism?

I've read here about importing a module in python. There is an option to not import a whole module (e.g. sys) and to only import a part of it (e.g. sys.argv). Is that possible in C? Can I include only the implementation of printf or any other function instead of the whole stdio.h library?
I ask this because it seems very inefficient to include a whole file where I need only several lines of code.
I understand that there is a possibility that including only the function itself won't work because it depends on other functions, other includes, defines, and globals. I only ask in order to use this for whole code blocks that contain all the data that are needed in order to execute.
C does not have anything that is equivalent to, or even similar to Python's "from ... import" mechanism.
I ask this because it seems very inefficient to include a whole file where I need only several lines of code.
Actually, what normally happens when you #include a file is that you import the declarations for macros, or functions declared somewhere else. You don't import any executable code ... so the "unnecessary" inclusions have ZERO impact on runtime code size or efficiency.
If you use (i.e. "call") a macro, then that causes the macro body to expanded, which adds to the executable code size.
If you call a function whose declaration you have included, that will add the code ... for the call statement itself. The function does not expanded though. Instead, an "external reference" is added to your ".o" file, which the loader resolves when you create the executable from the ".o" files and the dependent libraries.
Python: "There is an option to not import a whole module" : I think you misunderstand what is going on here. When you specify the names to import, it means that only those names go into you namespace. The "whole" module is compiled, and any code outside functions is run, even when you specify just one name.
C: I am going to assume that you are using an operating system like UNIX/Linux/OS X or Windows (the following does not apply to embedded systems).
The closest C has to import is dynamic runtime linking. That is not part of standard C, it is defined by the operating system. So POSIX has one mechanism and Windows has another. Most people call these library files "DLLs", but strictly speaking that is a Microsoft term, they are "shared objects" (.so) on UNIX type systems.
When a process attaches to a DLL or .so then it is "mapped" into the virtual memory of the process. The detail here varies between operating systems, but essentially the code is split into "pages", the size of which varies, but 4kb for 32-bit systems and 16kb for 64-bit is typical. Only those pages that are required are loaded into memory. When a page is required then a so-called "page-fault" occurs and the operating system will get the page from either the executable file or the swap area (depending on the OS).
One of the advantages of this mechanism is that code pages can be shared between processes. So if you have 50 processes all using the same DLL (like the C run-time library, for example), then only one copy is actually loaded into memory. They all share the one set of pages (they can because they are read-only).
There is no sharing mechanism like that in Python - unless the module is itself written in C and is a DLL (.pyd).
All this occurs without the knowledge of the program.
EDIT: looking at other's answers I realise you might be thinking of the #include pre-processor directive to merge a header file into the source code. Assuming these are standard header files, then they make no difference to the size of your executable, they should be "idempotent". That is, they only contain information of use by the pre-processor, compiler, or linker. If there are definitions in the header file that are not used there should be no side effect.
Linking libraries (-l directive to the compiler) that are not used will make the executable larger, which makes the page tables larger, but aside from that if they are not used then they shouldn't make any significant difference. That is because of the on-demand page-loading described above (the concept was invented in the 1960s in Manchester UK).

How come Python does not include a function to load a pickle from a file name?

I often include this, or something close to it, in Python scripts and IPython notebooks.
import cPickle
def unpickle(filename):
with open(filename) as f:
obj = cPickle.load(f)
return obj
This seems like a common enough use case that the standard library should provide a function that does the same thing. Is there such a function? If there isn't, how come?
Most of the serialization libraries in the stdlib and on PyPI have a similar API. I'm pretty sure it was marshal that set the standard,* and pickle, json, PyYAML, etc. have just followed in its footsteps.
So, the question is, why was marshal designed that way?
Well, you obviously need loads/dumps; you couldn't build those on top of a filename-based function, and to build them on top of a file-object-based function you'd need StringIO, which didn't come until later.
You don't necessarily need load/dump, because those could be built on top of loads/dumps—but doing so could have major performance implications: you can't save anything to the file until you've built the whole thing in memory, and vice-versa, which could be a problem for huge objects.
You definitely don't need a loadf/dumpf function based on filenames, because those can be built trivially on top of load/dump, with no performance implications, and no tricky considerations that a user is likely to get wrong.
On the one hand, it would be convenient to have them anyway—and there are some libraries, like ElementTree, that do have analogous functions. It may only save a few seconds and a few lines per project, but multiply that by thousands of projects…
On the other hand, it would make Python larger. Not so much the extra 1K to download and install it if you added these two functions to every module (although that did mean a lot more back in the 1.x days than nowadays…), but more to document, more to learn, more to remember. And of course more code to maintain—every time you need to fix a bug in marshal.dumpf you have to remember to go check pickle.dumpf and json.dumpf to make sure they don't need the change, and sometimes you won't remember.
Balancing those two considerations is really a judgment call. One someone made decades ago and probably nobody has really discussed since. If you think there's a good case for changing it today, you can always post a feature request on the issue tracker or start a thread on python-ideas.
* Not in the original 1991 version of marshal.c; that just had load and dump. Guido added loads and dumps in 1993 as part of a change whose main description was "Add separate main program for the Mac: macmain.c". Presumably because something inside the Python interpreter needed to dump and load to strings.**
** marshal is used as the underpinnings for things like importing .pyc files. This also means (at least in CPython) it's not just implemented in C, but statically built into the core of the interpreter itself. While I think it actually could be turned into a regular module since the 3.4 import changes, but it definitely couldn't have back in the early days. So, that's extra motivation to keep it small and simple.

Query on python execution model

Below is the program that defines a function within another function.
1) When we say python program.py Does every line of python source directly gets converted to set of machine instructions that get executed on processor?
2) Above diagram has GlobalFrame and LocalFrame and Objects. In the above program, Where does Frames Objects and code reside in runtime? Is there a separate memory space given to this program within python interpreter's virtual memory address space?
"Does every line of python source directly gets converted to set of machine instructions that get executed on processor?"
No. Python code (not necessarily by line) typically gets converted to an intermediate code which is then interpreted by what some call a "virtual machine" (confusingly, as VM means something completely different in other contexts, but ah well). CPython, the most popular implementation (which everybody thinks of as "python":-), uses its own bytecode and interpreter thereof. Jython uses Java bytecode and a JVM to run it. And so on. PyPy, perhaps the most interesting implementation, can emit almost any sort of resulting code, including machine code -- but it's far from a line by line process!-)
"Where does Frames Objects and code reside in runtime"
On the "heap", as defined by the malloc, or equivalent, in the C programming language in the CPython implementation (or Java for Jython, etc, etc).
That is, whenever a new PyObject is made (in CPython's internals), a malloc or equivalent happens and that object is forevermore referred via a pointer (a PyObject*, in C syntax). Functions, frames, code objects, and so forth, almost everything is an object in Python -- no special treatment, "everything is first-class"!-)

Linking and Loading in interpreted languages

In compiled languages, the source code is turned into object code by the compiler and the different object files (if there are multiple files) are linked by the linker and loaded into the memory by the loader for execution.
If I have an application written using an interpreted language (for eg., ruby or python) and if the source code is split across files, when exactly are the files brought together. To put it other words when is the linking done? Do interpreted languages have Linkers and Loaders in the first place or the interpreter does everything?
I am really confused about this and not able to get my head around it!! Can anyone shine some light on this?!
An interpreted language is more or less a large configuration for an executable that is called interpreter. That executable (e. g. /usr/bin/python) is the program which actually runs. It then reads the script it shall execute (e. g. /home/alfe/bin/factorial.py) and executes it, in the simplest form line-by-line.
During that process it can encounter references to other files (other modules, e. g. /usr/python/lib/math.py) and then it will read and interpret those.
Many such languages have mechanisms built in to reduce the overhead of this process by creating byte-code versions of the scripts they interpreted. So there might well be a file /usr/python/lib/math.pyc for instance, which the interpreter put there after first processing and which it can faster read and interpret than the original /usr/python/lib/math.py. But this is not really part of the concept of interpreted languages¹.
Sometimes, a binary library is part of an interpreted language; depending on the sophistication of the interpreter it can link that library at runtime and then use it. This is most typical for the system modules and stuff which needs to be highly optimized.
But in general one can say that no binary machine code gets generated at all. And nothing is linked at the compile time. Actually, there is no real compile time, even though one could call that first processing of the input scripts a compile step.
Footnotes:
¹) The concept of interpreting scripts does encompass neither that "compiling" (pre-translating of the source into a faster-to-interpret form) nor that "caching" of this form by storing files like the .pyc files. WRT to your question concerning linking and splitting programs into several files or modules, these aspects of precompiling and caching are just technical details to speed up things. The concept itself is: read one line of the input script & execute it. Then read the next line and so on.
Well, in Python, modules are loaded and executed or parsed when the interpreter finds some method or indication to do so. There's no linking but there is loading of course (when the file is requested in the code).
Python do something clever to improve its performance. It compiles to bytecode (.pyc files) the first time it executes a file. This improves substantially the execution of the code next time the module is imported or executed.
So the behavior is more or less:
A file is executed
Inside the file, the interpreter finds a reference to another file
It parses it and potentially execute it. This means that every class, variable or method definition will become available in the runtime.
And this is how the process is done (very general). Of course, there are optimizations and caches to improve the performance.
Hope this helps!

Categories

Resources