I have a comprehension question not related to any particular language, but since I am writing in python, I tagged python. I am asked to provide some data in "fixed length, flatfile without separators". It confuses me, since I understand it like:
Input: Column A: date (len6)
Input: Column B: name (len20)
Output: "20170409MYVERYSHORTNAME[space][space][space][space][space]"
"MYVERYSHORTNAME" is only 15 char long, but since it's fixed 20-length output, I am supposed to fill 5 times it with something ? It's not specified.
Why do someone even needs a file without separators? He/she will need to break it down to separated fields anyway, what's the point?
This kind of flat (binary) file is meant to be faster/easier to read by machines, and more memory efficient than the equivalent in a more human friendly representation (eg, JSON, CSV, etc.). For example, the machine can preallocate the appropriate amount of memory before reading the contents.
Nowadays, with the virtually unlimited quantity of RAM and dynamic nature of the languages, nobody uses flat files anymore (unless it is specifically needed).
In Python, in order to deal properly with this kind of binary files, you can for example use the struct module from the standard library:
https://docs.python.org/3.6/library/struct.html#module-struct
Example:
import struct
from datetime import datetime
mydate = datetime.now()
myshortname = "HelloWorld!"
struct.pack("8s20s", mydate.strftime('%Y%m%d').encode(), myshortname.encode())
>>> b'201709HelloWorld!\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Typically, when you see fixed-length files, you're dealing with legacy systems. The AS400, for instance, usually spits out fixed-length files with artificial separators (why, I don't know, but that's what I've seen).
Usually, strings are right-padded with spaces, and numbers are left-padded with 0's (zeros).
This is not absolute.
Related
I'm trying to do some work on a file, the file has various data in it, and I'm pulling it in in string/raw format, and then working on the strings.
I'm trying to make the process multithreaded, so I can work on several chunks at once, but of course the files are quite large, several gigabytes, so memory is an issue.
The processes don't need to modify the input data, so they don't need their own copies. However, I don't know how to make an array of strings as a ctype in Python 2.7.
Currently I have:
import multiprocessing, ctypes
from multiprocessing.sharedctypes import Value, Array
with open('test.txt', 'r') as fin:
rawdata = Array('c', fin.readlines(), lock=False)
But this doesn't work as I'd hoped, it sees the whole thing as one massive char buffer array and fails as it wants a single string object. I need to be able to pull out the original lines and work with them with existing python code that examines the contents of the lines and does some operations, which vary from substring matching, to pulling out integer and float values from the strings for mathematical operations. Is there any sensible way I can achieve this that I'm missing? Perhaps I'm using the wrong item (Array), to push the data to a shared c format?
Do you want your strings to end up as Python strings, or as c-style strings a.k.a. null-terminated character arrays? If you're working with python string processing, then simply reading the file into a non-ctypes python string and using that everywhere is the way to go -- python doesn't copy strings by default, since they're immutable anyway. If you want to use c-style strings, then you will want to allocate a character buffer using ctypes, and use fin.readinto(buffer).
I would like to use re module with streams, but not necessarily file streams, at minimal development cost.
For file streams, there's mmap module that is able to impersonate a string and as such can be used freely with re.
Now I wonder how mmap manages to craft an object that re can further reuse. If I just pass whatever, re protect itself against usage of too incompatible objects with TypeError: expected string or bytes-like object. So I thought I'd create a class that derives from string or bytes and override a few methods such as __getitem__ etc. (this intuitively fits the duck typing philosophy of Python), and make them interact with my original stream. However, this doesn't seem to work at all - my overrides are completely ignored.
Is it possible to create such a "lazy" string in pure Python, without C extensions? If so, how?
A bit of background to disregard alternative solutions:
Can't use mmap (the stream contents are not a file)
Can't dump the whole thing to the HDD (too slow)
Can't load the whole thing to the memory (too large)
Can seek, know the size and compute the content at runtime
Example code that demonstrates bytes resistance to modification:
class FancyWrapper(bytes):
def __init__(self, base_str):
pass #super() isn't called and yet the code below finds abc, aaa and bbb
print(re.findall(b'[abc]{3}', FancyWrapper(b'abc aaa bbb def')))
Well, I found out that it's not possible, not currently.
Python's re module internally operates on the strings in the sense that it scans through a plain C buffer, which requires the object it receives to satisfy these properties:
Their representation must reside in the system memory,
Their representation must be linear, e.g. it cannot contain gaps of any sort,
Their representation must contain the content we're searching in as a whole.
So even if we managed to make re work with something else than bytes or string, we'd have to use mmap-like behavior, i.e. impersonate our content provider as linear region in the system memory.
But the mmap mechanism will work only for files, and in fact, even this is also pretty limited. For example, one can't mmap a large file if one tries to write to it, as per this answer.
Even the regex module, which contains many super duper additions such as (?r), doesn't accommodate for content sources outside string and bytes.
For completeness: does this mean we're screwed and can't scan through large dynamic content with re? Not necessarily. There's a way to do it, if we permit a limit on max match size. The solution is inspired by cfi's comment, and extends it to binary files.
Let n = max match size.
Start a search at position x
While there's content:
Navigate to position x
Read 2*n bytes to scan buffer
Find the first match within scan buffer
If match was found:
Let x = x + match_pos + match_size
Notify about the match_pos and match_size
If match wasn't found:
Let x = x + n
What this accomplishes by using twice as big buffer as the max match size? Imagine the user searches for A{3} and the max match size is set to 3. If we'd read just max match size bytes to the scan buffer and the data at current x contained AABBBA:
This iteration would look at AAB. No match.
The next iteration would move the pointer to x+3.
Now the scan buffer would look like this: BBA. Still no match.
This is obviously bad, and the simple solution is to read twice as many bytes as we jump over, to ensure the anomaly near the scan buffer's tail is resolved.
Note that the short-circuiting on the first match within the scan buffer is supposed to protect against other anomalies such as buffer underscans. It could probably be tweaked to minimize reads for scan buffers that contain multiple matches, but I wanted to avoid further complicating things.
This probably isn't the most performant algorithm made, but is good enough for my use case, so I'm leaving it here.
Goal
Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).
Method
When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.
My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.
I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?
So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change.
That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.
On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:
Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.
import mmap
with open("infilename", "r") as in_f:
in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
length = in_view.size()
with open("outfilename", "w") as out_f
out_view = mmap.mmap(out_f.fileno(), length)
for i in range(length):
if in_view[i] == 12:
out_view[i] = in_view[i] + i % 10
else:
out_view[i] = in_view[i]
What about slicing the string, modify each slice, write it back on the disk before moving on to the next slice? Too intensive for the disk?
I have a file header which I am reading and planning on writing which contains information about the contents; version information, and other string values.
Writing to the file is not too difficult, it seems pretty straightforward:
outfile.write(struct.pack('<s', "myapp-0.0.1"))
However, when I try reading back the header from the file in another method:
header_version = struct.unpack('<s', infile.read(struct.calcsize('s')))
I have the following error thrown:
struct.error: unpack requires a string argument of length 2
How do I fix this error and what exactly is failing?
Writing to the file is not too difficult, it seems pretty straightforward:
Not quite as straightforward as you think. Try looking at what's in the file, or just printing out what you're writing:
>>> struct.pack('<s', 'myapp-0.0.1')
'm'
As the docs explain:
For the 's' format character, the count is interpreted as the size of the string, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1.
So, how do you deal with this?
Don't use struct if it's not what you want. The main reason to use struct is to interact with C code that dumps C struct objects directly to/from a buffer/file/socket/whatever, or a binary format spec written in a similar style (e.g. IP headers). It's not meant for general serialization of Python data. As Jon Clements points out in a comment, if all you want to store is a string, just write the string as-is. If you want to store something more complex, consider the json module; if you want something even more flexible and powerful, use pickle.
Use fixed-length strings. If part of your file format spec is that the name must always be 255 characters or less, just write '<255s'. Shorter strings will be padded, longer strings will be truncated (you might want to throw in a check for that to raise an exception instead of silently truncating).
Use some in-band or out-of-band means of passing along the length. The most common is a length prefix. (You may be able to use the 'p' or 'P' formats to help, but it really depends on the C layout/binary format you're trying to match; often you have to do something ugly like struct.pack('<h{}s'.format(len(name)), len(name), name).)
As for why your code is failing, there are multiple reasons. First, read(11) isn't guaranteed to read 11 characters. If there's only 1 character in the file, that's all you'll get. Second, you're not actually calling read(11), you're calling read(1), because struct.calcsize('s') returns 1 (for reasons which should be obvious from the above). Third, either your code isn't exactly what you've shown above, or infile's file pointer isn't at the right place, because that code as written will successfully read in the string 'm' and unpack it as 'm'. (I'm assuming Python 2.x here; 3.x will have more problems, but you wouldn't have even gotten that far.)
For your specific use case ("file header… which contains information about the contents; version information, and other string values"), I'd just use write the strings with newline terminators. (If the strings can have embedded newlines, you could backslash-escape them into \n, use C-style or RFC822-style continuations, quote them, etc.)
This has a number of advantages. For one thing, it makes the format trivially human-readable (and human-editable/-debuggable). And, while sometimes that comes with a space tradeoff, a single-character terminator is at least as efficient, possibly more so, than a length-prefix format would be. And, last but certainly not least, it means the code is dead-simple for both generating and parsing headers.
In a later comment you clarify that you also want to write ints, but that doesn't change anything. A 'i' int value will take 4 bytes, but most apps write a lot of small numbers, which only take 1-2 bytes (+1 for a terminator/separator) if you write them as strings. And if you're not writing small numbers, a Python int can easily be too large to fit in a C int—in which case struct will silently overflow and just write the low 32 bits.
As a side project I would like to try to parse binary files (Mach-O files specifically). I know tools exist for this already (otool) so consider this a learning exercise.
The problem I'm hitting is that I don't understand how to convert the binary elements found into a python representation. For example, the Mach-O file format starts with a header which is defined by a C Struct. The first item is a uint_32 'magic number' field. When i do
magic = f.read(4)
I get
b'\xcf\xfa\xed\xfe'
This is starting to make sense to me. It's literally a byte array of 4 bytes. However I want to treat this like a 4-byte int that represents the original magic number. Another example is the numberOfSections field. I just want the number represented by 4-byte field, not an array of literal bytes.
Perhaps I'm thinking about this all wrong. Has anybody worked on anything similar? Do I need to write functions to look these 4-byte byte arrays and shift and combine their values to produce the number I want? Is endienness going to screw me here? Any pointers would be most helpful.
Take a look at the struct module:
In [1]: import struct
In [2]: magic = b'\xcf\xfa\xed\xfe'
In [3]: decoded = struct.unpack('<I', magic)[0]
In [4]: hex(decoded)
Out[4]: '0xfeedfacf'
There's Kaitai Struct project that solves exactly that problem. First, you describe a certain file format using a .ksy spec, then you compile it into a Python library (or, actually, a library in any other major programming language), import it, and, voila, parsing boils down to:
from mach_o import MachO
my_file = MachO.from_file("/path/to/your/file")
my_file.magic # => 0xfeedface
my_file.num_of_sections # => some other integer
my_file.sections # => list of objects that represent sections
They have a growing repository of file format specs. It doesn't have Mach-O file format spec (yet?), but there are complex formats like Java .class or Microsoft's PE executable described there, so I guess it shouldn't be a major problem to write spec for Mach-O format as well.
It is actually better than Construct or Hachoir, because it's compiled (as opposed to interpreted), thus it's faster, and it includes tons of other helpful tools like visualizer or format diagram maker. For example, this is a generated explanation diagram for PE executable format:
I would suggest the Construct module. It offers a very high level interface.