Goal
Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).
Method
When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.
My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.
I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?
So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change.
That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.
On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:
Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.
import mmap
with open("infilename", "r") as in_f:
in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
length = in_view.size()
with open("outfilename", "w") as out_f
out_view = mmap.mmap(out_f.fileno(), length)
for i in range(length):
if in_view[i] == 12:
out_view[i] = in_view[i] + i % 10
else:
out_view[i] = in_view[i]
What about slicing the string, modify each slice, write it back on the disk before moving on to the next slice? Too intensive for the disk?
Related
I'm trying to do some work on a file, the file has various data in it, and I'm pulling it in in string/raw format, and then working on the strings.
I'm trying to make the process multithreaded, so I can work on several chunks at once, but of course the files are quite large, several gigabytes, so memory is an issue.
The processes don't need to modify the input data, so they don't need their own copies. However, I don't know how to make an array of strings as a ctype in Python 2.7.
Currently I have:
import multiprocessing, ctypes
from multiprocessing.sharedctypes import Value, Array
with open('test.txt', 'r') as fin:
rawdata = Array('c', fin.readlines(), lock=False)
But this doesn't work as I'd hoped, it sees the whole thing as one massive char buffer array and fails as it wants a single string object. I need to be able to pull out the original lines and work with them with existing python code that examines the contents of the lines and does some operations, which vary from substring matching, to pulling out integer and float values from the strings for mathematical operations. Is there any sensible way I can achieve this that I'm missing? Perhaps I'm using the wrong item (Array), to push the data to a shared c format?
Do you want your strings to end up as Python strings, or as c-style strings a.k.a. null-terminated character arrays? If you're working with python string processing, then simply reading the file into a non-ctypes python string and using that everywhere is the way to go -- python doesn't copy strings by default, since they're immutable anyway. If you want to use c-style strings, then you will want to allocate a character buffer using ctypes, and use fin.readinto(buffer).
I have a comprehension question not related to any particular language, but since I am writing in python, I tagged python. I am asked to provide some data in "fixed length, flatfile without separators". It confuses me, since I understand it like:
Input: Column A: date (len6)
Input: Column B: name (len20)
Output: "20170409MYVERYSHORTNAME[space][space][space][space][space]"
"MYVERYSHORTNAME" is only 15 char long, but since it's fixed 20-length output, I am supposed to fill 5 times it with something ? It's not specified.
Why do someone even needs a file without separators? He/she will need to break it down to separated fields anyway, what's the point?
This kind of flat (binary) file is meant to be faster/easier to read by machines, and more memory efficient than the equivalent in a more human friendly representation (eg, JSON, CSV, etc.). For example, the machine can preallocate the appropriate amount of memory before reading the contents.
Nowadays, with the virtually unlimited quantity of RAM and dynamic nature of the languages, nobody uses flat files anymore (unless it is specifically needed).
In Python, in order to deal properly with this kind of binary files, you can for example use the struct module from the standard library:
https://docs.python.org/3.6/library/struct.html#module-struct
Example:
import struct
from datetime import datetime
mydate = datetime.now()
myshortname = "HelloWorld!"
struct.pack("8s20s", mydate.strftime('%Y%m%d').encode(), myshortname.encode())
>>> b'201709HelloWorld!\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Typically, when you see fixed-length files, you're dealing with legacy systems. The AS400, for instance, usually spits out fixed-length files with artificial separators (why, I don't know, but that's what I've seen).
Usually, strings are right-padded with spaces, and numbers are left-padded with 0's (zeros).
This is not absolute.
I would like to use re module with streams, but not necessarily file streams, at minimal development cost.
For file streams, there's mmap module that is able to impersonate a string and as such can be used freely with re.
Now I wonder how mmap manages to craft an object that re can further reuse. If I just pass whatever, re protect itself against usage of too incompatible objects with TypeError: expected string or bytes-like object. So I thought I'd create a class that derives from string or bytes and override a few methods such as __getitem__ etc. (this intuitively fits the duck typing philosophy of Python), and make them interact with my original stream. However, this doesn't seem to work at all - my overrides are completely ignored.
Is it possible to create such a "lazy" string in pure Python, without C extensions? If so, how?
A bit of background to disregard alternative solutions:
Can't use mmap (the stream contents are not a file)
Can't dump the whole thing to the HDD (too slow)
Can't load the whole thing to the memory (too large)
Can seek, know the size and compute the content at runtime
Example code that demonstrates bytes resistance to modification:
class FancyWrapper(bytes):
def __init__(self, base_str):
pass #super() isn't called and yet the code below finds abc, aaa and bbb
print(re.findall(b'[abc]{3}', FancyWrapper(b'abc aaa bbb def')))
Well, I found out that it's not possible, not currently.
Python's re module internally operates on the strings in the sense that it scans through a plain C buffer, which requires the object it receives to satisfy these properties:
Their representation must reside in the system memory,
Their representation must be linear, e.g. it cannot contain gaps of any sort,
Their representation must contain the content we're searching in as a whole.
So even if we managed to make re work with something else than bytes or string, we'd have to use mmap-like behavior, i.e. impersonate our content provider as linear region in the system memory.
But the mmap mechanism will work only for files, and in fact, even this is also pretty limited. For example, one can't mmap a large file if one tries to write to it, as per this answer.
Even the regex module, which contains many super duper additions such as (?r), doesn't accommodate for content sources outside string and bytes.
For completeness: does this mean we're screwed and can't scan through large dynamic content with re? Not necessarily. There's a way to do it, if we permit a limit on max match size. The solution is inspired by cfi's comment, and extends it to binary files.
Let n = max match size.
Start a search at position x
While there's content:
Navigate to position x
Read 2*n bytes to scan buffer
Find the first match within scan buffer
If match was found:
Let x = x + match_pos + match_size
Notify about the match_pos and match_size
If match wasn't found:
Let x = x + n
What this accomplishes by using twice as big buffer as the max match size? Imagine the user searches for A{3} and the max match size is set to 3. If we'd read just max match size bytes to the scan buffer and the data at current x contained AABBBA:
This iteration would look at AAB. No match.
The next iteration would move the pointer to x+3.
Now the scan buffer would look like this: BBA. Still no match.
This is obviously bad, and the simple solution is to read twice as many bytes as we jump over, to ensure the anomaly near the scan buffer's tail is resolved.
Note that the short-circuiting on the first match within the scan buffer is supposed to protect against other anomalies such as buffer underscans. It could probably be tweaked to minimize reads for scan buffers that contain multiple matches, but I wanted to avoid further complicating things.
This probably isn't the most performant algorithm made, but is good enough for my use case, so I'm leaving it here.
I have a file header which I am reading and planning on writing which contains information about the contents; version information, and other string values.
Writing to the file is not too difficult, it seems pretty straightforward:
outfile.write(struct.pack('<s', "myapp-0.0.1"))
However, when I try reading back the header from the file in another method:
header_version = struct.unpack('<s', infile.read(struct.calcsize('s')))
I have the following error thrown:
struct.error: unpack requires a string argument of length 2
How do I fix this error and what exactly is failing?
Writing to the file is not too difficult, it seems pretty straightforward:
Not quite as straightforward as you think. Try looking at what's in the file, or just printing out what you're writing:
>>> struct.pack('<s', 'myapp-0.0.1')
'm'
As the docs explain:
For the 's' format character, the count is interpreted as the size of the string, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1.
So, how do you deal with this?
Don't use struct if it's not what you want. The main reason to use struct is to interact with C code that dumps C struct objects directly to/from a buffer/file/socket/whatever, or a binary format spec written in a similar style (e.g. IP headers). It's not meant for general serialization of Python data. As Jon Clements points out in a comment, if all you want to store is a string, just write the string as-is. If you want to store something more complex, consider the json module; if you want something even more flexible and powerful, use pickle.
Use fixed-length strings. If part of your file format spec is that the name must always be 255 characters or less, just write '<255s'. Shorter strings will be padded, longer strings will be truncated (you might want to throw in a check for that to raise an exception instead of silently truncating).
Use some in-band or out-of-band means of passing along the length. The most common is a length prefix. (You may be able to use the 'p' or 'P' formats to help, but it really depends on the C layout/binary format you're trying to match; often you have to do something ugly like struct.pack('<h{}s'.format(len(name)), len(name), name).)
As for why your code is failing, there are multiple reasons. First, read(11) isn't guaranteed to read 11 characters. If there's only 1 character in the file, that's all you'll get. Second, you're not actually calling read(11), you're calling read(1), because struct.calcsize('s') returns 1 (for reasons which should be obvious from the above). Third, either your code isn't exactly what you've shown above, or infile's file pointer isn't at the right place, because that code as written will successfully read in the string 'm' and unpack it as 'm'. (I'm assuming Python 2.x here; 3.x will have more problems, but you wouldn't have even gotten that far.)
For your specific use case ("file header… which contains information about the contents; version information, and other string values"), I'd just use write the strings with newline terminators. (If the strings can have embedded newlines, you could backslash-escape them into \n, use C-style or RFC822-style continuations, quote them, etc.)
This has a number of advantages. For one thing, it makes the format trivially human-readable (and human-editable/-debuggable). And, while sometimes that comes with a space tradeoff, a single-character terminator is at least as efficient, possibly more so, than a length-prefix format would be. And, last but certainly not least, it means the code is dead-simple for both generating and parsing headers.
In a later comment you clarify that you also want to write ints, but that doesn't change anything. A 'i' int value will take 4 bytes, but most apps write a lot of small numbers, which only take 1-2 bytes (+1 for a terminator/separator) if you write them as strings. And if you're not writing small numbers, a Python int can easily be too large to fit in a C int—in which case struct will silently overflow and just write the low 32 bits.
I am dumping a string of 0s and 1s of length 4807100171 into a pickle file because I had previous trouble with bitarray and wanted to see if pickle could be a solution to my problem. However, after I load it, it now is of length 512132875.
Why is that?
I have searched to see if there is any limitations from pickle, but I haven't found anything... If there is a well known reason, I might not be using the correct key words...
Edit:
You can fill a string b of random values so you get a length of 4807100171 with the technique you prefer - perhaps something like a simple for loop going to 4807100171. I personally encrypt original data using Huffman coding but it would be a long example that I feel is not really necessary here.
I then dump the string b as follow:
b = ""
for i in range(4807100171)
b += 0
import cPickle as pickle
pickle.dump(b, open("string.p", "wb"), pickle.HIGHEST_PROTOCOL)
This is obviously an integer overflow problem - notice that 4807100171 minus 2**32 is 512132875. Unfortunately, a 32-bit integer is how the binary pickle format represents string lengths. It appears that using the text pickle format (protocol version 0) would avoid this problem, but text pickles are generally longer, and would take an absurd amount of memory to handle a string of this size. I haven't actually tested this - I don't think I have enough memory on any of my computers to do so!
If this one string is the only thing being stored, then it would be far simpler to just write the string itself to a file.