Python3 compiled App Decrease Size with zip?

Python3 compiled App Decrease Size with zip? - python

I have compiled with success a rather big app ,but the size of the app is way too big(about 1 GB).I realized that almost the half (400MB) was taken from just 2 modules(PyQt5 and mpl_toolkits(basemap)).Of course the one approach is to delete the side-services you dont need but i cant in my case as i use about all of these in my app.
Then i was thinking about ziping these modules but i don't know if this is safe or execute time viable.
So is there a suficient way to minimize the size of my app, and if the ZIP is the only answer is it viable to do this on a lib such PyQt5 witch is imported many times in my script?

It is entirely possible to do this; Python has always supported "zipapps", and as of 3.5, provides a module, zipapp, to assist with creating them.
The only concern is whether your dependencies run properly zipped; most do, but some take file-system layer dependencies that zipping (and eggs and the like) don't handle (the setup.py file for these would declare zip_safe=False to tell tools not to allow it). Packages with C extensions would have to assert zip_safe=True, as the default analyzer assumes C extensions can't be safely zipped (though many can).
It will slow down startup a bit (since it has to decompress the modules), but it shouldn't dramatically slow runtime, no matter how many times you import it; the first import is the only one that really matters, since subsequent imports just pull the cached copy of the module from an internal cache (sys.modules).

Related

In my "small" python exe GUI program the tcl folder has 820 files (mostly tzdata). Any chance of reducing this number?

As stated in the title:; I have a "small" python exe GUI program generated by pyinstaller which creates a tcl folder that has 820 files (mostly tzdata). Any chance of reducing this number?
It takes a long time to copy the program because of all the tiny files.
I've used the datetime library. I just need the date and time to pop up on a pdf that I'm printing, so doesn't need to be that fancy. I just need the time on the computer :)
I can use "--onefile" to just get the .exe, but that takes too long to open.
Program is only for Windows atm.

You can almost certainly delete the http1.0 and opt0.4 directories outright. They're obsolete packages included for backward compatibility only.
The *.tcl and tclIndex files should be left (except for parray.tcl, which you likely don't need).
Of the encoding, msgs and tzdata directories, if you're deploying in a restricted set of locations, you can delete a lot of that; you only need the encodings, message catalogs and timezone definitions that you actually use when running. Thus, if you're only supporting English speakers in the USA, you can delete a very large fraction of the files. (If you're not using Tcl to format or parse dates at all, you don't need any timezone definitions.) The main encoding that you must retain is the one that the scripts are written in! (NB: support for the UTF-8 and ISO8859-1 encodings, and the UTF-16-derived ones used for talking to the Windows API, are all built in directly to Tcl; you can't remove support for them.)
Which things you can remove depend on your application and where you deploy it. That's why we can't tell you outright which files to delete.

Generally the 'blunt' approach is to attack the problem by deleting the files(or some files) and see if your programm works as intended without any bugs.That can be some times rather complicating and time consuming,and some times not even possible.
Libraries like pyinstaller and cx_freeze tends to be super inclusive of files that you don't even need so the programm is guaranteed to work.
Generally i advise you to create an installer for your programm(like Inno Setup) that will look really more professional and will diminish your current problem.
Also python supports ziped libraries that can drastically discreaze the size of the app on some libraries.Look one of my own question on the topic Python3 compiled App Decrease Size with zip?.
Have fun!

How to make Python ssl module use data in memory rather than pass file paths?

The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...

It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.

How do I speed up repeated calls a ruby program (github's linguist) from python?

I'm using github's linguist to identify unknown source code files. Running this from the command line after a gem install github-linguist is insanely slow. I'm using python's subprocess module to make a command-line call on a stock Ubuntu 14 installation.
Running against an empty file: linguist __init__.py takes about 2 seconds (similar results for other files). I assume this is completely from the startup time of Ruby. As #MartinKonecny points out, it seems that it is the linguist program itself.
Is there some way to speed this process up -- or a way to bundle the calls together?

One possibility is to just adapt the linguist program (https://github.com/github/linguist/blob/master/bin/linguist) to take multiple paths on the command-line. It requires mucking with a bit of Ruby, sure, but it would make it possible to pass multiple files without the startup overhead of Linguist each time.
A script this simple could suffice:
require 'linguist/file_blob'
ARGV.each do |path|
blob = Linguist::FileBlob.new(path, Dir.pwd)
# print out blob.name, blob.language, blob.sloc, etc.
end

Is there a way to combine a python project codebase that spans across different files into one file?

The reason I want to this is I want to use the tool pyobfuscate to obfuscate my python code. Butpyobfuscate can only obfuscate one file.

I've answered your direct question separately, but let me offer a different solution to what I suspect you're actually trying to do:
Instead of shipping obfuscated source, just ship bytecode files. These are the .pyc files that get created, cached, and used automatically, but you can also create them manually by just using the compileall module in the standard library.
A .pyc file with its .py file missing can be imported just fine. It's not human-readable as-is. It can of course be decompiled into Python source, but the result is… basically the same result you get from running an obfuscater on the original source. So, it's slightly better than what you're trying to do, and a whole lot easier.
You can't compile your top-level script this way, but that's easy to work around. Just write a one-liner wrapper script that does nothing but import the real top-level script. If you have if __name__ == '__main__': code in there, you'll also need to move that to a function, and the wrapper becomes a two-liner that imports the module and calls the function… but that's as hard as it gets.) Alternatively, you could run pyobfuscator on just the top-level script, but really, there's no reason to do that.
In fact, many of the packager tools can optionally do all of this work for you automatically, except for writing the trivial top-level wrapper. For example, a default py2app build will stick compiled versions of your own modules, along with stdlib and site-packages modules you depend on, into a pythonXY.zip file in the app bundle, and set up the embedded interpreter to use that zipfile as its stdlib.

There are a definitely ways to turn a tree of modules into a single module. But it's not going to be trivial. The simplest thing I can think of is this:
First, you need a list of modules. This is easy to gather with the find command or a simple Python script that does an os.walk.
Then you need to use grep or Python re to get all of the import statements in each file, and use that to topologically sort the modules. If you only do absolute flat import foo statements at the top level, this is a trivial regex. If you also do absolute package imports, or from foo import bar (or from foo import *), or import at other levels, it's not much trickier. Relative package imports are a bit harder, but not that big of a deal. Of course if you do any dynamic importing, use the imp module, install import hooks, etc., you're out of luck here, but hopefully you don't.
Next you need to replace the actual import statements. With the same assumptions as above, this can be done with a simple sed or re.sub, something like import\s+(\w+) with \1 = sys.modules['\1'].
Now, for the hard part: you need to transform each module into something that creates an equivalent module object dynamically. This is the hard part. I think what you want to do is to escape the entire module code so that it can put into a triple-quoted string, then do this:
import types
mod_globals = {}
exec('''
# escaped version of original module source goes here
''', mod_globals)
mod = types.ModuleType(module_name)
mod.__dict__.update(mod_globals)
sys.modules[module_name] = mod
Now just concatenate all of those transformed modules together. The result will be almost equivalent to your original code, except that it's doing the equivalent of import foo; del foo for all of your modules (in dependency order) right at the start, so the startup time could be a little slower.

You can make a tool that:
Reads through your source files and puts all identifiers in a set.
Subtracts all identifiers from recursively searched standard- and third party modules from that set (modules, classes, functions, attributes, parameters).
Subtracts some explicitly excluded identifiers from that list as well, as they may be used in getattr/setattr/exec/eval
Replaces the remaining identifiers by gibberish
Or you can use this tool I wrote that does exactly that.
To obfuscate multiple files, use it as follows:
For safety, backup your source code and valuable data to an off-line medium.
Put a copy of opy_config.txt in the top directory of your project.
Adapt it to your needs according to the remarks in opy_config.txt.
This file only contains plain Python and is exec’ed, so you can do anything clever in it.
Open a command window, go to the top directory of your project and run opy.py from there.
If the top directory of your project is e.g. ../work/project1 then the obfuscation result will be in ../work/project1_opy.
Further adapt opy_config.txt until you’re satisfied with the result.
Type ‘opy ?’ or ‘python opy.py ?’ (without the quotes) on the command line to display a help text.

I think you can try using the find command with -exec option.
you can execute all python scripts in a directory with the following command.
find . -name "*.py" -exec python {} ';'
Wish this helps.
EDIT:
OH sorry I overlooked that if you obfuscate files seperately they may not run properly, because it renames function names to different names in different files.

Reliable way to get path to py file of a module

I'm trying to figure out the best way to reliably discover at runtime the location on the file system of the py file for a given module. I need to do this because I plan to externalize some configuration data about some methods (in this case, schemas to be used for validation of responses from service calls for which interfaces are defined in a module) for cleanliness and ease of maintenance.
An simplified illusutration of the system:
package
|
|-service.py
|
|-call1.scm
|
|-call2.scm
service.py (_call() is a method on the base class, though that's irrelevant to the question)
class FooServ(AbstractService):
def call1(*args):
result = self._call('/relative/uri/call1', *args)
# additional call specific processing
return result
def call2(*args):
result = self._call('/relative/uri/call2', *args)
# additional call specific processing
return result
call1.scm and call2.scm define the response schemas (in the current case, using the draft JSON schema format, though again, irrelevant to the question)
In another place in the codebase, when the service calls are actually made, I want to be able to detect the location of service.py so that I can traverse the file structure and find the scm files. At least on my system, I think that this will work:
# I realize this is contrived here, but in my code, the method is accessed this way
method = FooServ().call1
module_path = sys.modules[method.__self__.__class__.__module__].__file__
schema_path = os.path.join(os.path.dirname(module_path), method.__name__ + '.scm')
However, I wanted to make sure this would be safe on all platforms and installation configurations, and I came across this while doing research, which made me concerned that trying to do this this way will not work reliably. Will this work universally, or is the fact that __file__ on a module object returns the location of the pyc file, which could be in some location other than along side the py file, going to make this solution ineffective? If it will make it ineffective, what, if anything, can I do instead?

In the PEP you link, it says:
In Python 3, when you import a module, its __file__ attribute points to its source py file (in Python 2, it points to the pyc file).
So in Python 3 you're fine because __file__ will always point to the .py file. In Python 2 it might point to the .pyc file, but that .pyc will only ever be in the same directory as the .py file for Python 2.
Okay, I think you're referring to this bit:
Because these distributions cannot share pyc files, elaborate mechanisms have been developed to put the resulting pyc files in non-shared locations while the source code is still shared. Examples include the symlink-based Debian regimes python-support [8] and python-central [9]. These approaches make for much more complicated, fragile, inscrutable, and fragmented policies for delivering Python applications to a wide range of users.
I believe those mechanisms are applied only to Python modules that are packaged by the distribution. I don't think they should affect modules installed manually outside of the distribution's packaging system. It would be the responsibility of whoever was packaging your module for that distribution to make sure that the module isn't broken by the mechanism.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.