saving a long string using cPickle - saved variable is truncated - python

I am dumping a string of 0s and 1s of length 4807100171 into a pickle file because I had previous trouble with bitarray and wanted to see if pickle could be a solution to my problem. However, after I load it, it now is of length 512132875.
Why is that?
I have searched to see if there is any limitations from pickle, but I haven't found anything... If there is a well known reason, I might not be using the correct key words...
Edit:
You can fill a string b of random values so you get a length of 4807100171 with the technique you prefer - perhaps something like a simple for loop going to 4807100171. I personally encrypt original data using Huffman coding but it would be a long example that I feel is not really necessary here.
I then dump the string b as follow:
b = ""
for i in range(4807100171)
b += 0
import cPickle as pickle
pickle.dump(b, open("string.p", "wb"), pickle.HIGHEST_PROTOCOL)

This is obviously an integer overflow problem - notice that 4807100171 minus 2**32 is 512132875. Unfortunately, a 32-bit integer is how the binary pickle format represents string lengths. It appears that using the text pickle format (protocol version 0) would avoid this problem, but text pickles are generally longer, and would take an absurd amount of memory to handle a string of this size. I haven't actually tested this - I don't think I have enough memory on any of my computers to do so!
If this one string is the only thing being stored, then it would be far simpler to just write the string itself to a file.

Related

How do I compress a rather long binary string in Python so that I will be able to access it later?

I have a long array of items (4700) that will ultimately be 1 or 0 when compared to settings in another list. I want to be able to construct a single integer/string item that I can store in some of the metadata such that it can be accessed later in order to uniquely identify the combination of items that goes into it.
I am writing this all in Python. I am thinking of doing something like zlib compression plus a hex conversion, but I am getting myself confused as to how to do the inverse transformation. So assuming bin_string is the string array of 1's and 0's it should look something like this
import zlib
#example bin_string, real one is much longer
bin_string="1001010010100101010010100101010010101010000010100101010"
compressed = zlib.compress(bin_string.encode())
this_hex = compressed.hex()
where I can then save this_hex to the metadata. The question is, how do I get the original bin_string back from my hex value? I have lots of Python experience with numerical methods and such but little with compression, so any basic insights would be very valuable.
Just do the inverse of each operation. This:
zlib.decompress(bytearray.fromhex(this_hex)).decode()
will return your original string.
It would be faster and might even result in better compression to simply encode your bits as bits in a byte string, along with a terminating one bit followed by zeros to pad out the last byte. That would be seven bytes instead of the 22 you're getting from zlib.compress(). zlib would do better only if there is a strong bias for 0's or 1's, and/or there are repeating patterns in the 0's and 1's.
As for encoding for the metadata, Base64 would be more compact than hexadecimal. Your example would be lKVKVKoKVQ==.
You should try using the .savez_compressed() method of numpy
Convert your simple array into a numpy array amd then use this -
numpy.savez_compressed("filename.npz")
Use
numpy.load()
To load the .npz file

Python Multiprocessing & ctype arrays

I'm trying to do some work on a file, the file has various data in it, and I'm pulling it in in string/raw format, and then working on the strings.
I'm trying to make the process multithreaded, so I can work on several chunks at once, but of course the files are quite large, several gigabytes, so memory is an issue.
The processes don't need to modify the input data, so they don't need their own copies. However, I don't know how to make an array of strings as a ctype in Python 2.7.
Currently I have:
import multiprocessing, ctypes
from multiprocessing.sharedctypes import Value, Array
with open('test.txt', 'r') as fin:
rawdata = Array('c', fin.readlines(), lock=False)
But this doesn't work as I'd hoped, it sees the whole thing as one massive char buffer array and fails as it wants a single string object. I need to be able to pull out the original lines and work with them with existing python code that examines the contents of the lines and does some operations, which vary from substring matching, to pulling out integer and float values from the strings for mathematical operations. Is there any sensible way I can achieve this that I'm missing? Perhaps I'm using the wrong item (Array), to push the data to a shared c format?
Do you want your strings to end up as Python strings, or as c-style strings a.k.a. null-terminated character arrays? If you're working with python string processing, then simply reading the file into a non-ctypes python string and using that everywhere is the way to go -- python doesn't copy strings by default, since they're immutable anyway. If you want to use c-style strings, then you will want to allocate a character buffer using ctypes, and use fin.readinto(buffer).

fixed length flat file without separators

I have a comprehension question not related to any particular language, but since I am writing in python, I tagged python. I am asked to provide some data in "fixed length, flatfile without separators". It confuses me, since I understand it like:
Input: Column A: date (len6)
Input: Column B: name (len20)
Output: "20170409MYVERYSHORTNAME[space][space][space][space][space]"
"MYVERYSHORTNAME" is only 15 char long, but since it's fixed 20-length output, I am supposed to fill 5 times it with something ? It's not specified.
Why do someone even needs a file without separators? He/she will need to break it down to separated fields anyway, what's the point?
This kind of flat (binary) file is meant to be faster/easier to read by machines, and more memory efficient than the equivalent in a more human friendly representation (eg, JSON, CSV, etc.). For example, the machine can preallocate the appropriate amount of memory before reading the contents.
Nowadays, with the virtually unlimited quantity of RAM and dynamic nature of the languages, nobody uses flat files anymore (unless it is specifically needed).
In Python, in order to deal properly with this kind of binary files, you can for example use the struct module from the standard library:
https://docs.python.org/3.6/library/struct.html#module-struct
Example:
import struct
from datetime import datetime
mydate = datetime.now()
myshortname = "HelloWorld!"
struct.pack("8s20s", mydate.strftime('%Y%m%d').encode(), myshortname.encode())
>>> b'201709HelloWorld!\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Typically, when you see fixed-length files, you're dealing with legacy systems. The AS400, for instance, usually spits out fixed-length files with artificial separators (why, I don't know, but that's what I've seen).
Usually, strings are right-padded with spaces, and numbers are left-padded with 0's (zeros).
This is not absolute.

change specific indexes in string to same value python

Goal
Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).
Method
When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.
My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.
I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?
So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change.
That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.
On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:
Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.
import mmap
with open("infilename", "r") as in_f:
in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
length = in_view.size()
with open("outfilename", "w") as out_f
out_view = mmap.mmap(out_f.fileno(), length)
for i in range(length):
if in_view[i] == 12:
out_view[i] = in_view[i] + i % 10
else:
out_view[i] = in_view[i]
What about slicing the string, modify each slice, write it back on the disk before moving on to the next slice? Too intensive for the disk?

Writing and reading headers with struct

I have a file header which I am reading and planning on writing which contains information about the contents; version information, and other string values.
Writing to the file is not too difficult, it seems pretty straightforward:
outfile.write(struct.pack('<s', "myapp-0.0.1"))
However, when I try reading back the header from the file in another method:
header_version = struct.unpack('<s', infile.read(struct.calcsize('s')))
I have the following error thrown:
struct.error: unpack requires a string argument of length 2
How do I fix this error and what exactly is failing?
Writing to the file is not too difficult, it seems pretty straightforward:
Not quite as straightforward as you think. Try looking at what's in the file, or just printing out what you're writing:
>>> struct.pack('<s', 'myapp-0.0.1')
'm'
As the docs explain:
For the 's' format character, the count is interpreted as the size of the string, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1.
So, how do you deal with this?
Don't use struct if it's not what you want. The main reason to use struct is to interact with C code that dumps C struct objects directly to/from a buffer/file/socket/whatever, or a binary format spec written in a similar style (e.g. IP headers). It's not meant for general serialization of Python data. As Jon Clements points out in a comment, if all you want to store is a string, just write the string as-is. If you want to store something more complex, consider the json module; if you want something even more flexible and powerful, use pickle.
Use fixed-length strings. If part of your file format spec is that the name must always be 255 characters or less, just write '<255s'. Shorter strings will be padded, longer strings will be truncated (you might want to throw in a check for that to raise an exception instead of silently truncating).
Use some in-band or out-of-band means of passing along the length. The most common is a length prefix. (You may be able to use the 'p' or 'P' formats to help, but it really depends on the C layout/binary format you're trying to match; often you have to do something ugly like struct.pack('<h{}s'.format(len(name)), len(name), name).)
As for why your code is failing, there are multiple reasons. First, read(11) isn't guaranteed to read 11 characters. If there's only 1 character in the file, that's all you'll get. Second, you're not actually calling read(11), you're calling read(1), because struct.calcsize('s') returns 1 (for reasons which should be obvious from the above). Third, either your code isn't exactly what you've shown above, or infile's file pointer isn't at the right place, because that code as written will successfully read in the string 'm' and unpack it as 'm'. (I'm assuming Python 2.x here; 3.x will have more problems, but you wouldn't have even gotten that far.)
For your specific use case ("file header… which contains information about the contents; version information, and other string values"), I'd just use write the strings with newline terminators. (If the strings can have embedded newlines, you could backslash-escape them into \n, use C-style or RFC822-style continuations, quote them, etc.)
This has a number of advantages. For one thing, it makes the format trivially human-readable (and human-editable/-debuggable). And, while sometimes that comes with a space tradeoff, a single-character terminator is at least as efficient, possibly more so, than a length-prefix format would be. And, last but certainly not least, it means the code is dead-simple for both generating and parsing headers.
In a later comment you clarify that you also want to write ints, but that doesn't change anything. A 'i' int value will take 4 bytes, but most apps write a lot of small numbers, which only take 1-2 bytes (+1 for a terminator/separator) if you write them as strings. And if you're not writing small numbers, a Python int can easily be too large to fit in a C int—in which case struct will silently overflow and just write the low 32 bits.

Categories

Resources