I have a really long string that I want to shorten into a smaller string of random characters similar to what a hash does. However, I want to be able to undo it later to read it. As far as I know hashes are unable to be undone and thus I could not read it later.
You can use Python's builtin compression library zlib
>>> from zlib import compress, decompress
>>> original = 'A' * 1024
>>> len(original)
1024
>>> compressed = compress(original.encode('utf-8'))
>>> len(compressed)
17
>>> original == decompress(compressed).decode('utf-8')
True
Note that the original string must contain some patterns to be compressed efficiently. In general, the more entropy original has, the longer compressed will be.
I ended up using a database and just stuck with the long strings. Originally, I was going to shorten them and not have a database, but I think a database is better anyways and it allows me to store the long form.
Related
I have a long array of items (4700) that will ultimately be 1 or 0 when compared to settings in another list. I want to be able to construct a single integer/string item that I can store in some of the metadata such that it can be accessed later in order to uniquely identify the combination of items that goes into it.
I am writing this all in Python. I am thinking of doing something like zlib compression plus a hex conversion, but I am getting myself confused as to how to do the inverse transformation. So assuming bin_string is the string array of 1's and 0's it should look something like this
import zlib
#example bin_string, real one is much longer
bin_string="1001010010100101010010100101010010101010000010100101010"
compressed = zlib.compress(bin_string.encode())
this_hex = compressed.hex()
where I can then save this_hex to the metadata. The question is, how do I get the original bin_string back from my hex value? I have lots of Python experience with numerical methods and such but little with compression, so any basic insights would be very valuable.
Just do the inverse of each operation. This:
zlib.decompress(bytearray.fromhex(this_hex)).decode()
will return your original string.
It would be faster and might even result in better compression to simply encode your bits as bits in a byte string, along with a terminating one bit followed by zeros to pad out the last byte. That would be seven bytes instead of the 22 you're getting from zlib.compress(). zlib would do better only if there is a strong bias for 0's or 1's, and/or there are repeating patterns in the 0's and 1's.
As for encoding for the metadata, Base64 would be more compact than hexadecimal. Your example would be lKVKVKoKVQ==.
You should try using the .savez_compressed() method of numpy
Convert your simple array into a numpy array amd then use this -
numpy.savez_compressed("filename.npz")
Use
numpy.load()
To load the .npz file
I'm trying to do some work on a file, the file has various data in it, and I'm pulling it in in string/raw format, and then working on the strings.
I'm trying to make the process multithreaded, so I can work on several chunks at once, but of course the files are quite large, several gigabytes, so memory is an issue.
The processes don't need to modify the input data, so they don't need their own copies. However, I don't know how to make an array of strings as a ctype in Python 2.7.
Currently I have:
import multiprocessing, ctypes
from multiprocessing.sharedctypes import Value, Array
with open('test.txt', 'r') as fin:
rawdata = Array('c', fin.readlines(), lock=False)
But this doesn't work as I'd hoped, it sees the whole thing as one massive char buffer array and fails as it wants a single string object. I need to be able to pull out the original lines and work with them with existing python code that examines the contents of the lines and does some operations, which vary from substring matching, to pulling out integer and float values from the strings for mathematical operations. Is there any sensible way I can achieve this that I'm missing? Perhaps I'm using the wrong item (Array), to push the data to a shared c format?
Do you want your strings to end up as Python strings, or as c-style strings a.k.a. null-terminated character arrays? If you're working with python string processing, then simply reading the file into a non-ctypes python string and using that everywhere is the way to go -- python doesn't copy strings by default, since they're immutable anyway. If you want to use c-style strings, then you will want to allocate a character buffer using ctypes, and use fin.readinto(buffer).
Goal
Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).
Method
When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.
My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.
I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?
So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change.
That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.
On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:
Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.
import mmap
with open("infilename", "r") as in_f:
in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
length = in_view.size()
with open("outfilename", "w") as out_f
out_view = mmap.mmap(out_f.fileno(), length)
for i in range(length):
if in_view[i] == 12:
out_view[i] = in_view[i] + i % 10
else:
out_view[i] = in_view[i]
What about slicing the string, modify each slice, write it back on the disk before moving on to the next slice? Too intensive for the disk?
I want to send a list through UDP/TCP, but since they support string list only, I need to convert the list into string and convert it back.
My list is like
['S1','S2','H1','C1','D8']
I know I can use
string_ = ''.join(list_)
to convert it into string.
But how to convert it back?
Or there is another way I can use UDP/TCP to send a list?
Custom format would depend on the assumptions about the list items format, so json looks like the safest way to go:
>>> import json
>>> data = json.dumps(['S1','S2','H1','C1','D8'])
>>> data
'["S1", "S2", "H1", "C1", "D8"]'
>>> json.loads(data)
[u'S1', u'S2', u'H1', u'C1', u'D8']
Use a separator:
string_ = ';'.join(list_)
list_ = string_.split(';')
You need to make sure the separator character can't be within your string. If it is, you might need encoding.
If you have python on both ends of network communication, you can use dumps and loads functions from pickle module design especially for serializing python objects:
import pickle
a = ['S1','S2','H1','C1','D8']
string_ = pickle.dumps(a)
...
a = pickle.loads(string_)
Otherwise solution proposed by #bereal is better one because there are json libraries for most programming languages. But it will demand some processing for data types not supported by json.
EDIT
As #bereal noticed there can be security problem with pickle because it's actually an executable language.
Maybe you could append a separator between the list elements and use it calling split to get the list back.
EDIT:
As #Eric Fortin mentionned in his answer, the separator should be something that cannot be in your string. One possibility is -- as he suggested -- to use encoding. Another possiblity is to send elements one by one, but that would obviously increase the communication overhead.
Note that your separator may be a sequence, it does not need to be one single character.
str_list = separator.join(list_)
new_list = str_list.split(separator)
If you know the format of your list elements, you may even go without using separators!
str_list = "".join(list_)
re.split(reg_exp, str_list)
I am dumping a string of 0s and 1s of length 4807100171 into a pickle file because I had previous trouble with bitarray and wanted to see if pickle could be a solution to my problem. However, after I load it, it now is of length 512132875.
Why is that?
I have searched to see if there is any limitations from pickle, but I haven't found anything... If there is a well known reason, I might not be using the correct key words...
Edit:
You can fill a string b of random values so you get a length of 4807100171 with the technique you prefer - perhaps something like a simple for loop going to 4807100171. I personally encrypt original data using Huffman coding but it would be a long example that I feel is not really necessary here.
I then dump the string b as follow:
b = ""
for i in range(4807100171)
b += 0
import cPickle as pickle
pickle.dump(b, open("string.p", "wb"), pickle.HIGHEST_PROTOCOL)
This is obviously an integer overflow problem - notice that 4807100171 minus 2**32 is 512132875. Unfortunately, a 32-bit integer is how the binary pickle format represents string lengths. It appears that using the text pickle format (protocol version 0) would avoid this problem, but text pickles are generally longer, and would take an absurd amount of memory to handle a string of this size. I haven't actually tested this - I don't think I have enough memory on any of my computers to do so!
If this one string is the only thing being stored, then it would be far simpler to just write the string itself to a file.