Is there any way to protect my executable file's resources such as .png and more which i used to make button designs etc in my python executable. Like if someone mess with them the executable will fail.
I mean like zipping or something which user cannot read or write but program or executable can.
You could protect them with a checksum (Like SHA-2), if the resource is changed, checksum will be changed, and you can emit an error.
Another approach would be to load it from a blob embedded into the program as byte array. This approach is worse, but it would help to prevent accidental tampering.
But: As soon as somebody, with enough interest downloads your program, everything you try to protect your resources, will fail
This can get pretty complicated fast, what you're looking for obfuscation.
You could go as simple as just checking an SHA1 checksum of a file you load to make sure it hasn't been altered, to cryptographically encoding your source to prevent targeted reverse engineering attacks.
I would recommend the following websites to read more about this:
https://pyob.oxyry.com/
https://docs.python.org/3/library/hashlib.html
But overall, this is a topic which is a bit too complex for a simple answer.
Related
I have a set of files (compiled software) that I want to give an unique fingerprint before distribution. The idea is to write a script that:
Randomly generates a character sequence
Appends the character sequence to a file in the project
Stores the fingerprint in a database with the addressee
Distributes the software to the addressee
The requirements for the fingerprint process is that:
The fingerprint is difficult to detect (i.e. not stored in the file metadata or easily accessible areas)
The fingerprint does not corrupt the data of the file the sequence is added to
The fingerprint can be added to an executable or dll file
It's easy to read the fingerprint if you know where to look
Are there any open source solutions that is built for the purpose of fingerprinting files?
Storing information in the file without corrupting it and in a way that is not easily detectable is an exercise in steganography, and quite a hard one. This theoretical tool needs to be able to parse executable structure, and properly modify it, edit offsets if needed, or detect padding arias, or basically do some of the work that the compiler is doing. I doubt that it exists or is reliable.
However, there are quite a few steganography tools that can store information in pictures by subtly changing the colors of the pixels, perhaps you can store your information in the icon of the exe file or any included asset.
Another way is to hide the data at compilation time, in optimization level of the performance-uncritical parts of the executable, so that compiler generates slightly different code, but the behavior is guaranteed to stay consistent. You can now use file hashes as your fingerprint.
Yet another way is to just create unused string inside some random function, mark it as volatile or analog in your language of choice to prevent the compiler from optimizing it out of your program and put something noticeable in it, like REPLACE_ME. Now you can open this file, search for this string and replace it with the identifier that you have generated. If identifier and the string were the same length - you can’t damage your software.
Another, more subtle way is to create multiple different rephrasings of the same messages in your app and swap them in and out as a way to differentiate versions. If your programming language stores null-terminated strings then this is very easy, just make your strings in the code as long as the longest rephrasing. If your language stores length of the string then you have to dynamically recalculate it too.
Alternatively, if you are working with the Unicode strings in your code, then you can use similar-looking glyphs in some strings as a less effort version of previous idea. Basically you are performing a homograph attack on your strings. Alternatively you can use unicode control chars (ZWJ, ZWNJ, etc.) that do not affect most languages and are invisible.
All schemes is easily discovered by diffing two different distributions of the software, the one with the different optimization levels could be plausibly written off as just different builds of the software, but the persistent attacker still could figure it out.
Since you are talking about compiled software, maybe an alternative solution could be to use an execbinary encrypting tool. When you execute the file it will ask for a password, if it's correct then it will use the password to generate a key. Then it uses that key to decrypt the program directly in memory. That way they won't be able to analyze the binary and even with the key it would be a lot more difficult to do so, much less modify it. You can put as many fingerprints as you like, regular text strings, into the code and they will most likely stay there.
I have to archive a large amount of data off of CDs and DVDs, and I thought it was an interesting problem that people might have useful input on. Here's the setup:
The script will be running on multiple boxes on multiple platforms, so I thought python would be the best language to use. If the logic creates a bottleneck, any other language works.
We need to archive ~1000 CDs and ~500 DVDs, so speed is a critical issue
The data is very valuable, so verification would be useful
The discs are pretty old, so a lot of them will be hard or impossible to read
Right now, I was planning on using shutil.copytree to dump the files into a directory, and compare file trees and sizes. Maybe throw in a quick hash, although that will probably slow things down too much.
So my specific questions are:
What is the fastest way to copy files off a slow medium like CD/DVDs? (or does the method even matter)
Any suggestions of how to deal with potentially failing discs? How do you detect discs that have issues?
When you read file by file, you're seeking randomly around the disc, which is a lot slower than a bulk transfer of contiguous data. And, since the fastest CD drives are several dozen times slower than the slowest hard drives (and that's not even counting the speed hit for doing multiple reads on each bad sector for error correction), you want to get the data off the CD as soon as possible.
Also, of course, having an archive as a .iso file or similar means that, if you improve your software later, you can re-scan the filesystem without needing to dig out the CD again (which may have further degraded in storage).
Meanwhile, trying to recovering damaged CDs, and damaged filesystems, is a lot more complicated than you'd expect.
So, here's what I'd do:
Block-copy the discs directly to .iso files (whether in Python, or with dd), and log all the ones that fail.
Hash the .iso files, not the filesystems. If you really need to hash the filesystems, keep in mind that the common optimization of compression the data before hashing (that is, tar czf - | shasum instead of just tar cf - | shasum) usually slows things down, even for easily-compressable data—but you might as well test it both ways on a couple discs. If you need your verification to be legally useful you may have to use a timestamped signature provided by an online service, instead, in which case compressing probably will be worthwhile.
For each successful .iso file, mount it and use basic file copy operations (whether in Python, or with standard Unix tools), and again log all the ones that fail.
Get a free or commercial CD recovery tool like IsoBuster (not an endorsement, just the first one that came up in a search, although I have used it successfully before) and use it to manually recover all of the damaged discs.
You can do a lot of this work in parallel—when each block copy finishes, kick off the filesystem dump in the background while you're block-copying the next drive.
Finally, if you've got 1500 discs to recover, you might want to invest in a DVD jukebox or auto-loader. I'm guessing new ones are still pretty expensive, but there must be people out there selling older ones for a lot cheaper. (From a quick search online, the first thing that came up was $2500 new and $240 used…)
Writing your own backup system is not fun. Have you considered looking at ready-to-use backup solutions? There are plenty, many free ones...
If you are still bound to write your own... Answering your specific questions:
With CD/DVD you first typically have to master the image (using a tool like mkisofs), then write image to the medium. There are tools that wrap both operations for you (genisofs I believe) but this is typically the process.
To verify the backup quality, you'll have to read back all written files (by mounting a newly written CD) and compare their checksums against those of the original files. In order to do incremental backups, you'll have to keep archives of checksums for each file you save (with backup date etc).
In distributing my app, I'd like to prevent casual users from viewing my png files, playing my mp3s or reading/modifying the plain text files I use to load and store data. The text I guess could be binary pickled? What about the images/sounds? What do you do when distributing your app?
Assuming py2exe or py2app.
You can use zip files, but they'll be visible while the program is running; you could extract them to a run-time generated temporary directory with tempfile.mkdtemp(), but it still would not be difficult to track them down.
Another solution would be to use a light-weight encryption, or even simple obfuscation (such as ROT13 for the text files, and a simple xor cipher on the binary files). This will add some time to the execution of your program, so make sure and take that into account.
You could archive those files and at runtime unarchive, use, then delete them:
Here is an article regarding Work with ZIP archives
Not a very strong protection method, but it will discourage hobby hackers
I am about to get a bunch of python scripts from an untrusted source.
I'd like to be sure that no part of the code can hurt my system, meaning:
(1) the code is not allowed to import ANY MODULE
(2) the code is not allowed to read or write any data, connect to the network etc
(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)
before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.
I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)
yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).
Then here comes the question:
(1) where can I get a 'global' list of python methods (like open)?
(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?
I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.
EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)
This point hasn't been made yet, and should be:
You are not going to be able to secure arbitrary Python code.
A VM is the way to go unless you want security issues up the wazoo.
You can still obfuscate import without using eval:
s = '__imp'
s += 'ort__'
f = globals()['__builtins__'].__dict__[s]
** BOOM **
Built-in functions.
Keywords.
Note that you'll need to do things like look for both "file" and "open", as both can open files.
Also, as others have noted, this isn't 100% certain to stop someone determined to insert malacious code.
An approach that should work better than string matching us to use module ast, parse the python code, do your whitelist filtering on the tree (e.g. allow only basic operations), then compile and run the tree.
See this nice example by Andrew Dalke on manipulating ASTs.
built in functions/keywords:
eval
exec
__import__
open
file
input
execfile
print can be dangerous if you have one of those dumb shells that execute code on seeing certain output
stdin
__builtins__
globals() and locals() must be blocked otherwise they can be used to bypass your rules
There's probably tons of others that I didn't think about.
Unfortunately, crap like this is possible...
object().__reduce__()[0].__globals__["__builtins__"]["eval"]("open('/tmp/l0l0l0l0l0l0l','w').write('pwnd')")
So it turns out keywords, import restrictions, and in-scope by default symbols alone are not enough to cover, you need to verify the entire graph...
Use a Virtual Machine instead of running it on a system that you are concerned about.
Without a sandboxed environment, it is impossible to prevent a Python file from doing harm to your system aside from not running it.
It is easy to create a Cryptominer, delete/encrypt/overwrite files, run shell commands, and do general harm to your system.
If you are on Linux, you should be able to use docker to sandbox your code.
For more information, see this GitHub issue: https://github.com/raxod502/python-in-a-box/issues/2.
I did come across this on GitHub, so something like it could be used, but that has a lot of limits.
Another approach would be to create another Python file which parses the original one, removes the bad code, and runs the file. However, that would still be hit-and-miss.
The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".