python os.read(fd, n) requires parameter n, why? - python

I need to read a text file with the os module as such:
t = os.open('te.txt', os.O_RDONLY)
r = os.read(t, 20)
rs = r.decode('utf-8')
print(rs)
What if I don't know the byte size of the file. I could put a very large number instead of 20 as a value seems to be required, but perhaps there is a more pythonic way.

The second argument isn't supposed to hold the size of the file in bytes; it's only supposed to hold the maximum amount of content you're prepared to read at a time (which should typically be divisible by both your operating system's block size and page size; 64kb is not a bad default).
The "why" of this is because memory has to be allocated in userspace before the kernel can be instructed to write content into that memory. This isn't the kind of detail that Python developers need to think about often, but you're using a low-level interface built for use from C; it accordingly has implementation details leaking out of that underlying layer.
The operating system is free to give you less than the number of bytes you indicate as a maximum (for example, if it gets interrupted, or the filesystem driver isn't written to provide that much data at a time), so no matter what, you need to be prepared to call it repeatedly; only when it returns an empty string (as opposed to throwing an exception or returning a shorter-than-requested string) are you certain to have reached the end of the file.
os.read() isn't a Pythonic interface, and it isn't supposed to be. It's a thin wrapper around the syscall provided by the operating system kernel. If you want a Pythonic interface, don't use os.read(), but instead use Python's native file objects.

If you wanted to load the whole file and you have to use os, you could use os.stat(filename).st_size or os.path.getsize(filename) to get the size of the file in bytes.
filename = 'te.txt'
t = os.open(filename, os.O_RDONLY)
b = os.stat(filename).st_size
r = os.read(t, b)
rs = r.decode('utf-8')
print(rs)

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks
You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

Seek on a large text file python

I have a few text files whose sizes range between 5 gigs and 50 gigs. I am using Python to read them. I have specific anchors in terms of byte offsets, to which I can seek and read the corresponding data from each of these files (using Python's file api).
The issue that I am seeing is that for relatively smaller files (< 5 gigs), this reading approach works well. However, for the much larger files (> 20 gigs) and especially when the file.seek function has to take longer jumps (like a few multi-million bytes at a time), it (sometimes) takes a few hundred milliseconds for it to do so.
My impression was that seek operations within the files are constant time operations. But apparently, they are not. Is there a way around it?
Here is what I am doing:
import time
f = open(filename, 'r+b')
f.seek(209)
current = f.tell()
t1 = time.time()
next = f.seek(current + 1200000000)
t2 = time.time()
line = f.readline()
delta = t2 - t1
The delta variable is varying between few microseconds to few hundreeld milliseconds, intermittently. I also profiled the cpu usage, and didnt see anything busy there as well.
Your code runs consistently in under 10 microseconds on my system (Windows 10, Python 3.7), so there is no obvious error in your code.
NB: You should use time.perf_counter() instead of time.time() for measuring performance. The granularity of time.time() can be very bad ("not all systems provide time with a better precision than 1 second"). When comparing timings with other systems you may get strange results.
My best guess is that the seek triggers some buffering (read-ahead) action, which might be slow, depending on your system.
According to the documentation:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
You could try to disable buffering by adding the argument buffering=0 to open() and check if that makes a difference:
open(filename, 'r+b', buffering=0)
A good way around that could be combining functions from OS module os.open (with flag os.O_RDONLY in your case), os.lseek, os.read which are at low-level I/O

os.read() gives OSError: [Errno 22] Invalid argument when reading large data

I use the following method to read binary data from any given offset in the binary file. The binary file I have is huge 10GB, so I usually read portion of it when needed by specifying from which offset I should start_read and how many bytes to read num_to_read. I use Python 3.6.4 :: Anaconda, Inc., platform Darwin-17.6.0-x86_64-i386-64bit and os module:
def read_from_disk(path, start_read, num_to_read, dim):
fd = os.open(path, os.O_RDONLY)
os.lseek(fd, start_read, 0) # Where to (start_read) from the beginning 0
raw_data = os.read(fd, num_to_read) # How many bytes to read
C = np.frombuffer(raw_data, dtype=np.int64).reshape(-1, dim).astype(np.int8)
os.close(fd)
return C
This method works very well when the chunk of data to be read is about less than 2GB. When num_to_read > 2GG, I get this error:
raw_data = os.read(fd, num_to_read) # How many to read (num_to_read)
OSError: [Errno 22] Invalid argument
I am not sure why this issue appears and how to fix it. Any help is highly appreciated.
The os.read function is just a thin wrapper around the platform's read function.
On some platforms, this is an unsigned or signed 32-bit int,1 which means the largest you can read in a single go on these platforms is, respectively, 4GB or 2GB.
So, if you want to read more than that, and you want to be cross-platform, you have to write code to handle this, and to buffer up multiple reads.
This may be a bit of a pain, but you are intentionally using the lowest-level directly-mapping-to-the-OS-APIs function here. If you don't like that:
Use io module objects (Python 3.x) or file objects (2.7) that you get back from open instead.
Just let NumPy read the files—which will have the added advantage that NumPy is smart enough to not try to read the whole thing into memory at once in the first place.
Or, for files this large, you may want to go lower level and use mmap (assuming you're on a 64-bit platform).
The right thing to do here is almost certainly a combination of the first two. In Python 3, it would look like this:
with open(path, 'rb', buffering=0) as f:
f.seek(start_read)
count = num_to_read // 8 # how many int64s to read
return np.fromfile(f, dtype=np.int64, count=count).reshape(-1, dim).astype(np.int8)
1. For Windows, the POSIX-emulation library's _read function uses int for the count argument, which is signed 32-bit. For every other modern platform, see POSIX read, and then look up the definitions of size_t, ssize_t, and off_t, on your platform. Notice that many POSIX platforms have separate 64-bit types, and corresponding functions, instead of changing the meaning of the existing types to 64-bit. Python will use the standard types, not the special 64-bit types.

How do I edit an executable with python by address/offset/bytes, like in a hex editor?

I've used Hex-rays IDA to find the bytes of code I need changed in a windows executable. I would like to write a python script that will programmatically edit those bytes.
I know the address (as given in hex-rays IDA) and I know the hexadecimal I wish to overwrite it with. How do I do this in python? I'm sure there is a simple answer, but I can't find it.
(For example: address = 0x00436411, and new hexadecimal = 0xFA)
You just need to open the executable as a file, for writing, in binary mode; then seek to the position you want to write; then write. So:
with open(path, 'r+b') as f:
f.seek(position)
f.write(new_bytes)
If you're going to be changing a lot of bytes, you may find it simpler to use mmap, which lets you treat the file as a giant list:
with open(path, 'r+b') as f:
with contextlib.closing(mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)) as m:
m[first_position] = first_new_byte
m[other_position] = other_new_byte
# ...
If you're trying to write multi-byte values (e.g., a 32-bit int), you probably want to use the struct module.
If what you know is an address in memory at runtime, rather than a file position, you have to be able to map that to the right place in the executable file. That may not even be possible (e.g., a memory-mapped region). But if it is, you should be able to find out from the debugger where it's mapped. From inside a debugger, this is easy; from outside, you need to parse the PE header structures and do a lot of complicated logic, and there is no reason to do that.
I believe when using hex-ray IDA as a static disassembler, with all the default settings, the addresses it gives you are the addresses where the code and data segments will be mapped into memory if they aren't forced to relocate. Those are, obviously, not offsets into the file.

Query size of block device file in Python

I have a Python script that reads a file (typically from optical media) marking the unreadable sectors, to allow a re-attempt to read said unreadable sectors on a different optical reader.
I discovered that my script does not work with block devices (e.g. /dev/sr0), in order to create a copy of the contained ISO9660/UDF filesystem, because os.stat().st_size is zero. The algorithm currently needs to know the filesize in advance; I can change that, but the issue (of knowing the block device size) remains, and it's not answered here, so I open this question.
I am aware of the following two related SO questions:
Determine the size of a block device (/proc/partitions, ioctl through ctypes)
how to check file size in python? (about non-special files)
Therefore, I'm asking: in Python, how can I get the file size of a block device file?
The “most clean” (i.e. not dependent on external volumes and most reusable) Python solution I've reached, is to open the device file and seek at the end, returning the file offset:
def get_file_size(filename):
"Get the file size by seeking at end"
fd= os.open(filename, os.O_RDONLY)
try:
return os.lseek(fd, 0, os.SEEK_END)
finally:
os.close(fd)
Linux-specific ioctl-based solution:
import fcntl
import struct
device_path = '/dev/sr0'
req = 0x80081272 # BLKGETSIZE64, result is bytes as unsigned 64-bit integer (uint64)
buf = b' ' * 8
fmt = 'L'
with open(device_path) as dev:
buf = fcntl.ioctl(dev.fileno(), req, buf)
bytes = struct.unpack('L', buf)[0]
print device_path, 'is about', bytes / (1024 ** 2), 'megabytes'
Other unixes will have different values for req, buf, fmt of course.
In Linux, there is /sys/block/${dev}/size that can be read even without sudo. To get the size of /dev/sdb simply do:
print( 512 * int(open('/sys/block/sdb/size','r').read()) )
See also https://unix.stackexchange.com/a/52219/384116
Another possible solution is
def blockdev_size(path):
"""Return device size in bytes.
"""
with open(path, 'rb') as f:
return f.seek(0, 2) or f.tell()
or f.tell() part is there for Python 2 portability's sake — file.seek() returns None in Python 2.
Magic constant 2 may be substituted with io.SEEK_END.
Trying to adapt from the other answer:
import fcntl
c = 0x00001260 ## check man ioctl_list, BLKGETSIZE
f = open('/dev/sr0', 'ro')
s = fcntl.ioctl(f, c)
print s
I don't have a suitable computer at hand to test this. I'd be curious to know if it works :)

Categories

Resources