use binary data after main program in Python [duplicate]

use binary data after main program in Python [duplicate] - python

I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?

You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))

Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.

Related

Randomized-offset binary raw disk writes with no caching whatsoever

For my application, I am attempting to determine whether a data backup system missed any writes. I am doing this by writing an incrementing integer counter to a 1GB virtual disk, and to make sure no writes were missed I can look at the reverted snapshot and see if there were any gaps (i.e. if I see 1, 2, 3, 0, 0, 6, 7 I know that the backup didn't get writes 4 and 5 correctly). This is all on a CentOS 7 VM, with mostly Python 2.7 scripts for writes/reads (speed isn't a huge concern)
A big part of my issues has been caching: since I'm simulating random I/O, writes are often flushed from caches and written to disk out of order. This makes every test appear as a false positive, since it looks like some data is missing at the time of the snapshot. Again, I don't really care about efficiency at all, so I don't mind really slow writes. Reads can use caching, that's not a problem, but also doesn't matter much one way or the other
Here are the things I have done to try to disable caching:
disable the disk write cache with sudo hdparm -W 0 /dev/sdb where /dev/sdb
writing to a raw disk with no filesystem, so no filesystem caching
set the buffering flag on with open in the Python script to 0 (no Python write cache)
Is it basically an impossible task to make sure that my writes get put on the disk in sequential order? All I need is write #(n) to happen before write #(n+1), and #(n+1) before #(n+2), etc.
This is the Python script I'm using to write to disk (SIZE and PRIME change based on the size of the disk an a random seed):
from struct import pack, unpack
import sys
SIZE,PRIME = [x],[x]
# random I/O traversal iterator
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
with open('/dev/sdb', 'rb+', buffering=0) as f:
index_gen = rand_index_generator(PRIME, SIZE)
# random traversal using iterator above, write counter to file
for counter in xrange(1, SIZE-16):
f.seek(index_gen.next()*4)
f.write(pack('>I', counter))
Then to validate I traverse in the same order and watch for gaps of unwritten data. This is after reverting the VM back to the snapshot. I know all the traversal and writing things work since validation will work smoothly with no missed writes before reverting, but I think some "written" data dies in RAM and doesn't make it to disk
Will take any suggestions to guarantee the write order I need for this application

Found out the answer to this question. I misunderstood the effect of writing to a raw disk, it did not eliminate OS caching since I was still calling the OS to write to my raw disk. Oops
To bypass OS caches you should use os.open and pass os.O_DIRECT and os.O_SYNC flags to make sure writes happen in the correct sequence (more info on those flags) and are not stuck in volatile memory. I used mmap and os file descriptors but you could also use the normal filehandles like this
Page size is specific to your operating system. For Linux it is 4096
The top section of the code stayed the same but here is the write loop:
PAGESIZE = 4096
filedesc = os.open('/dev/sdb', os.O_DIRECT|os.O_SYNC|os.O_RDWR)
for counter in xrange(1, SIZE-16):
write_loc = index_gen.next()*4
page_dist = (write_loc%PAGESIZE)
offset = write_loc - page_dist
bytemap = mmap.mmap(filedesc, PAGESIZE, offset=offset)
bytemap[page_dist:(page_dist+4)] = pack('>I', counter)
bytemap.flush()
bytemap.close()

Python tarfile size

I can calculate the size of the files in a tarfile in this way:
import tarfile
tf = tarfile.open(name='my.tgz', mode='r')
reduce(lambda x,y: getattr(x, 'size', x)+getattr(y,'size',y), tf.getmembers())
but the total size returned is the sum of the elements in the tarfile and not the compressed file size (at least this is what I believe by trying this).
Is there a way to get the compressed size of the whole tar file without checking it through something like the os.path.getsize?

No.
The way tar.gz works is that the file is piped through gzip to get a plain tar archive. tar(1) has no idea that the archive was compressed in the first place, so it can't know about compressed sizes[*].
This is unlike archive formats like ZIP which compress by themselves.
The advantage of the tar approach is that you can use any compression that you like. If some better compressor comes along, you can easily repack your archives. Also, since everything is put into one big stream of data, compression ratio is slightly better and meta data like file names is also compressed.
The disadvantage is that you must seek in the archive file to unpack individual items.
[*]: The first implementations of tar(1) had no -z option; it was added later when people started to use gzip a lot. In the early days, the standard compression was using compress to get tar.Z.

How to determine file's extended attributes and resource forks with their size on Mac OSX?

I had written a small utility for creating xml for any folder structure and comparison of folders via generated xml that supports both win and Mac as platforms. However on Mac, recursively calculating folder size don't adds up to total size. On investigation, it came that it is due to extended attributes and resource forks that were present on certain files.
Can anybody know how can I determine these extended attributes and resource forks and their size preferably in python. Currently, I am using os.path.getsize to determine the size of file and adding files size to determine folder size eventually.

You want the hidden member of a stat result called st_blocks.
>>> s = os.stat('some_file')
>>> s
posix.stat_result(st_mode=33261, st_ino=12583347, st_dev=234881026,
st_nlink=1, st_uid=1000, st_gid=20, st_size=9889973,
st_atime=1301371810, st_mtime=847731600, st_ctime=1301371422)
>>> s.st_size / 1e6 # size of data fork only, in MB
9.889973
>>> x.st_blocks * 512e-6 # total size on disk, in MB
20.758528
The file in question has about 10 MB in the resource fork, which shows up in the result from stat but in a "hidden" attribute. (Bonus points for anyone who knows exactly which file this is.) Note that it is documented in man 2 stat that the st_blocks attribute always measures increments of 512 bytes.
Note: st_size measures the number of bytes of data, but st_blocks measures size on disk including the overhead from partially used blocks. So,
>>> open('file.txt', 'w').write('Hello, world!')
13
>>> s = os.stat('file.txt')
>>> s.st_size
13
>>> s.st_blocks * 512
4096
Now if you do a "Get Info" in the Finder, you'll see that the file has:
Size: 4 KB on disk (13 bytes)

Merely a partial answer ... but to learn the size of resource forks you can simply use the namedfork psuedodirectory
os.path.getsize("<path to file of interest>/..namedfork/rsrc")
Its theoretically possible that other named forks may exist ... but you can't discover a list of available forks.
As to the extended attributes ... what "size" are you interested in? You can use the xattr module to discover their content and thus the length of the key/value pairs.
But if you are interested more in their "on disk" size ... then its worth noting that extended attributes are not stored in some sort of file. They form part of the file metadata (ie just like the name and modified time are metadata) and are stored directly within a B*-tree node, rather than some "file"

Two options:
You could try using subprocess to call the system's "ls" or "du" command, which should be aware of the extended attributes.
or
You could install the xattr package, which can read the resource fork in addition to extended attributes (it's accessed via xattr.XATTR_RESOURCEFORK_NAME. Something like this might work:
import xattr
x = xattr.xattr("/path/to/my/file")
size_ = 0
for attribute in x:
size_ += len(x[attribute])
print size_
You might need to play around a little with the format of the extended attributes, as they're returned as strings but might be binary (?).
If you provide a minimal almost working example of code, I might be able to play with it a little more.

Fast Reading of 10000 Binary Files?

I have 10,000 binary files, named like this:
file0.bin
file1.bin
............
............
file10000.bin
Each of the above files contains exactly 391 float values (1564 bytes per file).
my goal is to read all of the files into a python array in the fastest way possible. If I open & close each file using a script, it takes a lot of time (about 8min!).
are there any other creative ways to read these files FAST?
I am using Ubuntu Linux and would prefer a solution that can work with Python. Thanks.

If you want even faster make ramdisk:
# mkfs -q /dev/ram1 $(( 2 * 10000)) ## roughly the size you need
# mkdir -p /ramcache
# mount /dev/ram1 /ramcache
# df -H | grep ramcache
now concat
# cat file{1..10000}.bin >> /ramcache/concat.bin ## thanks SiegeX
Then let your script on that file
Since I haven't tested I prefixed everything with '#' so that you wouldn't have any accidents. Just remove them if you want it to work.
This is an option but I would urge you to consider looking at the comments people have posted directly under your Q
You could probably get better results examining what you are doing wrong as I could not reproduce your speed problem of 8 mins.

Iterate over them and use optimise flag you might also want to parse them using pypy it compiles python via a JIT compiler allowing for a somewhat marked increase in speed.

You have 10001 files (0 to 10000 inclusive) and it takes 8 minutes to run the following?
try: xrange # python 2 - 3 compatibility
except NameError: xrange= range
import array
final= array.array('f')
for file_seq in xrange(10001):
with open("file%d.bin" % file_seq, "rb") as fp:
final.fromfile(fp, 391)
What's the underlying filesystem? How much RAM do you have? What's your processor and its speed?

python zipfile module doesn't seem to be compressing my files

I made a little helper function:
import zipfile
def main(archive_list=[],zfilename='default.zip'):
print zfilename
zout = zipfile.ZipFile(zfilename, "w")
for fname in archive_list:
print "writing: ", fname
zout.write(fname)
zout.close()
if __name__ == '__main__':
main()
The problem is that all my files are NOT being COMPRESSED! The files are the same size and, effectively, just the extension is being change to ".zip" (from ".xls" in this case).
I'm running python 2.5 on winXP sp2.

This is because ZipFile requires you to specify the compression method. If you don't specify it, it assumes the compression method to be zipfile.ZIP_STORED, which only stores the files without compressing them. You need to specify the method to be zipfile.ZIP_DEFLATED. You will need to have the zlib module installed for this (it is usually installed by default).
import zipfile
def main(archive_list=[],zfilename='default.zip'):
print zfilename
zout = zipfile.ZipFile(zfilename, "w", zipfile.ZIP_DEFLATED) # <--- this is the change you need to make
for fname in archive_list:
print "writing: ", fname
zout.write(fname)
zout.close()
if __name__ == '__main__':
main()
Update: As per the documentation (python 3.7), value for 'compression' argument should be specified to override the default, which is ZIP_STORED. The available options are ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA and the corresponding libraries zlib, bz2 or lzma should be available.

There is a really easy way to compress zip format,
Use in shutil.make_archive library.
For example:
import shutil
shutil.make_archive(file_name, 'zip', file location after compression)
Can see more extensive documentation at: Here

Hope this is going to be useful to someone.
I tested all zip modes and benchmarked them on two data sets. First one small (~30 MB) and other large (~ 1,5 GB). They consisted of various types of files so it would be as close to real life scenario as possible. I did two methods of tests on each dataset: the “proportional” one and the “complete” one. Both tests where repeated 3 times one after another to get an average. Those result may differ depending on your machines, but I think it’s still a good place to start.
I did the test in two methods because I’m trying to make my own specialized backup solution.
The proportional method creates more zip files but it allows me to transfer smaller packages of data if necessary eg. replacing only things that changed. It's more complicated than that, but it is not important right now.
The complete method is just straight up compressing whole folder.
Compression ratio calculation:
size_difference = source_size - compressed_size
compression_ratio = (size_difference * 100.0) / source_size
Basically the higher that number the better.
Each zip archive was initialized like this:
# Mode tests
with zipfile.ZipFile(target_zip, 'w', compression_method) as ziph:
# Level tests
with zipfile.ZipFile(target_zip, 'w', compression_method, compresslevel=level) as ziph:
Here are the results:
It seems that no matter the method, the most optimal compression mode is ZIP_DEFLATED.
The only smaller archive size gave me ZIP_LZMA mode, but it was only fraction of % and it took about 8x longer for large data sets.
Furthermore I tried different levels of compression with the same data set and methods. Except this time there was only one run per level.
It looks like ZIP_DEFLATED and ZIP_BIP2 have similar compression capabilities, but the second one is much slower. For large data sets the compression level of 1 or 2 should suffice. Increasing it more gives no significant effect on final file size. If the workload demands a lot of “small” zip files it is better to use level 9. It gives high compression ratio but takes about the same amount of time as at level 1.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.