Query size of block device file in Python - python

I have a Python script that reads a file (typically from optical media) marking the unreadable sectors, to allow a re-attempt to read said unreadable sectors on a different optical reader.
I discovered that my script does not work with block devices (e.g. /dev/sr0), in order to create a copy of the contained ISO9660/UDF filesystem, because os.stat().st_size is zero. The algorithm currently needs to know the filesize in advance; I can change that, but the issue (of knowing the block device size) remains, and it's not answered here, so I open this question.
I am aware of the following two related SO questions:
Determine the size of a block device (/proc/partitions, ioctl through ctypes)
how to check file size in python? (about non-special files)
Therefore, I'm asking: in Python, how can I get the file size of a block device file?

The “most clean” (i.e. not dependent on external volumes and most reusable) Python solution I've reached, is to open the device file and seek at the end, returning the file offset:
def get_file_size(filename):
"Get the file size by seeking at end"
fd= os.open(filename, os.O_RDONLY)
try:
return os.lseek(fd, 0, os.SEEK_END)
finally:
os.close(fd)

Linux-specific ioctl-based solution:
import fcntl
import struct
device_path = '/dev/sr0'
req = 0x80081272 # BLKGETSIZE64, result is bytes as unsigned 64-bit integer (uint64)
buf = b' ' * 8
fmt = 'L'
with open(device_path) as dev:
buf = fcntl.ioctl(dev.fileno(), req, buf)
bytes = struct.unpack('L', buf)[0]
print device_path, 'is about', bytes / (1024 ** 2), 'megabytes'
Other unixes will have different values for req, buf, fmt of course.

In Linux, there is /sys/block/${dev}/size that can be read even without sudo. To get the size of /dev/sdb simply do:
print( 512 * int(open('/sys/block/sdb/size','r').read()) )
See also https://unix.stackexchange.com/a/52219/384116

Another possible solution is
def blockdev_size(path):
"""Return device size in bytes.
"""
with open(path, 'rb') as f:
return f.seek(0, 2) or f.tell()
or f.tell() part is there for Python 2 portability's sake — file.seek() returns None in Python 2.
Magic constant 2 may be substituted with io.SEEK_END.

Trying to adapt from the other answer:
import fcntl
c = 0x00001260 ## check man ioctl_list, BLKGETSIZE
f = open('/dev/sr0', 'ro')
s = fcntl.ioctl(f, c)
print s
I don't have a suitable computer at hand to test this. I'd be curious to know if it works :)

Related

python os.read(fd, n) requires parameter n, why?

I need to read a text file with the os module as such:
t = os.open('te.txt', os.O_RDONLY)
r = os.read(t, 20)
rs = r.decode('utf-8')
print(rs)
What if I don't know the byte size of the file. I could put a very large number instead of 20 as a value seems to be required, but perhaps there is a more pythonic way.
The second argument isn't supposed to hold the size of the file in bytes; it's only supposed to hold the maximum amount of content you're prepared to read at a time (which should typically be divisible by both your operating system's block size and page size; 64kb is not a bad default).
The "why" of this is because memory has to be allocated in userspace before the kernel can be instructed to write content into that memory. This isn't the kind of detail that Python developers need to think about often, but you're using a low-level interface built for use from C; it accordingly has implementation details leaking out of that underlying layer.
The operating system is free to give you less than the number of bytes you indicate as a maximum (for example, if it gets interrupted, or the filesystem driver isn't written to provide that much data at a time), so no matter what, you need to be prepared to call it repeatedly; only when it returns an empty string (as opposed to throwing an exception or returning a shorter-than-requested string) are you certain to have reached the end of the file.
os.read() isn't a Pythonic interface, and it isn't supposed to be. It's a thin wrapper around the syscall provided by the operating system kernel. If you want a Pythonic interface, don't use os.read(), but instead use Python's native file objects.
If you wanted to load the whole file and you have to use os, you could use os.stat(filename).st_size or os.path.getsize(filename) to get the size of the file in bytes.
filename = 'te.txt'
t = os.open(filename, os.O_RDONLY)
b = os.stat(filename).st_size
r = os.read(t, b)
rs = r.decode('utf-8')
print(rs)

os.read() gives OSError: [Errno 22] Invalid argument when reading large data

I use the following method to read binary data from any given offset in the binary file. The binary file I have is huge 10GB, so I usually read portion of it when needed by specifying from which offset I should start_read and how many bytes to read num_to_read. I use Python 3.6.4 :: Anaconda, Inc., platform Darwin-17.6.0-x86_64-i386-64bit and os module:
def read_from_disk(path, start_read, num_to_read, dim):
fd = os.open(path, os.O_RDONLY)
os.lseek(fd, start_read, 0) # Where to (start_read) from the beginning 0
raw_data = os.read(fd, num_to_read) # How many bytes to read
C = np.frombuffer(raw_data, dtype=np.int64).reshape(-1, dim).astype(np.int8)
os.close(fd)
return C
This method works very well when the chunk of data to be read is about less than 2GB. When num_to_read > 2GG, I get this error:
raw_data = os.read(fd, num_to_read) # How many to read (num_to_read)
OSError: [Errno 22] Invalid argument
I am not sure why this issue appears and how to fix it. Any help is highly appreciated.
The os.read function is just a thin wrapper around the platform's read function.
On some platforms, this is an unsigned or signed 32-bit int,1 which means the largest you can read in a single go on these platforms is, respectively, 4GB or 2GB.
So, if you want to read more than that, and you want to be cross-platform, you have to write code to handle this, and to buffer up multiple reads.
This may be a bit of a pain, but you are intentionally using the lowest-level directly-mapping-to-the-OS-APIs function here. If you don't like that:
Use io module objects (Python 3.x) or file objects (2.7) that you get back from open instead.
Just let NumPy read the files—which will have the added advantage that NumPy is smart enough to not try to read the whole thing into memory at once in the first place.
Or, for files this large, you may want to go lower level and use mmap (assuming you're on a 64-bit platform).
The right thing to do here is almost certainly a combination of the first two. In Python 3, it would look like this:
with open(path, 'rb', buffering=0) as f:
f.seek(start_read)
count = num_to_read // 8 # how many int64s to read
return np.fromfile(f, dtype=np.int64, count=count).reshape(-1, dim).astype(np.int8)
1. For Windows, the POSIX-emulation library's _read function uses int for the count argument, which is signed 32-bit. For every other modern platform, see POSIX read, and then look up the definitions of size_t, ssize_t, and off_t, on your platform. Notice that many POSIX platforms have separate 64-bit types, and corresponding functions, instead of changing the meaning of the existing types to 64-bit. Python will use the standard types, not the special 64-bit types.

Calculate a CRC / CRC32 hash / checksum on a binary file in Python using a buffer

I've been trying to teach myself Python so I don't fully understand what I'm doing. I'm embarrassed to say this but my question should be really easy to answer. I want to be able to do a CRC checksums on binary files with code similar to this:
# http://upload.wikimedia.org/wikipedia/commons/7/72/Pleiades_Spitzer_big.jpg
import zlib
buffersize = 65536
with open('Pleiades_Spitzer_big.jpg', 'rb') as afile:
buffr = afile.read(buffersize)
while len(buffr) > 0:
crcvalue = zlib.crc32(buffr)
buffr = afile.read(buffersize)
print(format(crcvalue & 0xFFFFFFFF, '08x'))
The correct result should be "a509ae4b" but my code's result is "dedf5161". I think what is happening is the checksum is being calculated on either the first or last 64kb of the file instead of the whole file.
How should the code be altered so it checks the entire file without loading the entire file into memory?
As it is, the code "works" in either Python 2.x or 3.x. If the code has to be in one or the other, I'd prefer it to be in 3.x.
You're currently calculating CRC of only the last chunk of the file. In order to fix this pass current crcvalue to crc32 as starting value:
import zlib
buffersize = 65536
with open('Pleiades_Spitzer_big.jpg', 'rb') as afile:
buffr = afile.read(buffersize)
crcvalue = 0
while len(buffr) > 0:
crcvalue = zlib.crc32(buffr, crcvalue)
buffr = afile.read(buffersize)
print(format(crcvalue & 0xFFFFFFFF, '08x')) # a509ae4b
Here's the relevant part from Python docs:
If value is present, it is used as the starting value of the checksum; otherwise, a default value of 0 is used. Passing in value allows computing a running checksum over the concatenation of several inputs.
While the accepted answer by #niemmi is excellent and accurate, here is Python 3.8+ compatible solution which helps simplify the code a bit.
Python 3.8+
The sample below makes use of the walrus assignment operator ( := ) to keep track of the chunks being read:
import zlib
size = 1024*1024*10 # 10 MiB chunks
with open('/tmp/test.txt', 'rb') as f:
crcval = 0
while chunk := f.read(size):
crcval = zlib.crc32(chunk, crcval)
print(f'{crcval & 0xFFFFFFFF:08x}')
Testing
echo "Some boring example text in a file." > /tmp/test.txt
$ crc32 /tmp/test.txt
2a30366b
Checksum value using the example code above:
2a30e66b

MPD, FIFO, Python, Audioop, Arduino, and Voltmeter: "Faking" a VU Meter

I'm trying to use a computer connected to an Arduino (which is itself connected to some 5V voltmeters) to "fake" an old school stereo VU meter. My goal is to have the computer that is playing the audio file analyze the signal and send the amplitude information to the Arudino via a serial connection to be displayed on the voltmeters.
I'm using MPD to render and send the audio to a USB DAC (ODAC). MPD is also outputting to a FIFO, which I read from using a Python script. I read from the FIFO in 4096 byte chunks, then use the audioop library to split that chunk/sample into a left and right channel and compute the maximum amplitude of each channel.
Here's the problem - I'm getting swamped with data. I'm guessing my math is wrong or that I don't understand how a FIFO works (or maybe both). MPD is outputting everything in 44100:16:2 format - I thought that meant that it would be writing out 44,100 4-byte samples per second. So if I'm grabbing 4096 byte chunks, I should expect about 43 chunks per second. But I'm getting far more than that (over 100) and the number of chunks I get per second doesn't change if I up my chunk size. For example, if I double my chunk size to 8192, I still get roughly the same number of chunks per second. So clearly I'm doing something wrong, but I don't know what it is. Anyone have any thoughts?
Here is the relevant portion of my mpd.conf file:
audio_output {
type "fifo"
name "my_fifo"
path "/tmp/mpd.fifo"
format "44100:16:2"
}
And here is the Python script:
import os
import audioop
import time
import errno
import math
#Open the FIFO that MPD has created for us
#This represents the sample (44100:16:2) that MPD is currently "playing"
fifo = os.open('/tmp/mpd.fifo', os.O_RDONLY)
while 1:
try:
rawStream = os.read(fifo, 4096)
except OSError as err:
if err.errno == errno.EAGAIN or err.errno == errno.EWOULDBLOCK:
rawStream = None
else:
raise
if rawStream:
leftChannel = audioop.tomono(rawStream, 2, 1, 0)
rightChannel = audioop.tomono(rawStream, 2, 0, 1)
stereoPeak = audioop.max(rawStream, 2)
leftPeak = audioop.max(leftChannel, 2)
rightPeak = audioop.max(rightChannel, 2)
leftDB = 20 * math.log10(leftPeak) -74
rightDB = 20 * math.log10(rightPeak) -74
print(rightPeak, leftPeak, rightDB, leftDB)
Answering my own question. It turns out that, regardless of how many bytes I specified should be read, os.read() was returning 2048 bytes. What that means is that the second parameter that os.read() takes is the maximum number of bytes it will read - but there's no guarantee that that many bytes will actually be read. I had thought that by leaving out the NONBLOCK option when I opened the FIFO that the os.read() call would wait around until it received an end of file or the number of bytes specified. But that's not the case. To get around this issue, my code now checks the length of the byte string returned by os.read() and - if that length is less than my specified chunk size - will wait to grab the next chunk(s) and then will concatenate all the chunks together so that I have a chunk size that matches my target before I move on to processing the data.

zlib decompression in python

Okay so I have some data streams compressed by python's (2.6) zlib.compress() function. When I try to decompress them, some of them won't decompress (zlib error -5, which seems to be a "buffer error", no idea what to make of that). At first, I thought I was done, but I realized that all the ones I couldn't decompress started with 0x78DA (the working ones were 0x789C), and I looked around and it seems to be a different kind of zlib compression -- the magic number changes depending on the compression used. What can I use to decompress the files? Am I hosed?
According to RFC 1950 , the difference between the "OK" 0x789C and the "bad" 0x78DA is in the FLEVEL bit-field:
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it
is there to indicate if recompression might be worthwhile.
"OK" uses 2, "bad" uses 3. So that difference in itself is not a problem.
To get any further, you might consider supplying the following information for each of compressing and (attempted) decompressing: what platform, what version of Python, what version of the zlib library, what was the actual code used to call the zlib module. Also supply the full traceback and error message from the failing decompression attempts. Have you tried to decompress the failing files with any other zlib-reading software? With what results? Please clarify what you have to work with: Does "Am I hosed?" mean that you don't have access to the original data? How did it get from a stream to a file? What guarantee do you have that the data was not mangled in transmission?
UPDATE Some observations based on partial clarifications published in your self-answer:
You are using Windows. Windows distinguishes between binary mode and text mode when reading and writing files. When reading in text mode, Python 2.x changes '\r\n' to '\n', and changes '\n' to '\r\n' when writing. This is not a good idea when dealing with non-text data. Worse, when reading in text mode, '\x1a' aka Ctrl-Z is treated as end-of-file.
To compress a file:
# imports and other superstructure left as a exercise
str_object1 = open('my_log_file', 'rb').read()
str_object2 = zlib.compress(str_object1, 9)
f = open('compressed_file', 'wb')
f.write(str_object2)
f.close()
To decompress a file:
str_object1 = open('compressed_file', 'rb').read()
str_object2 = zlib.decompress(str_object1)
f = open('my_recovered_log_file', 'wb')
f.write(str_object2)
f.close()
Aside: Better to use the gzip module which saves you having to think about nasssties like text mode, at the cost of a few bytes for the extra header info.
If you have been using 'rb' and 'wb' in your compression code but not in your decompression code [unlikely?], you are not hosed, you just need to flesh out the above decompression code and go for it.
Note carefully the use of "may", "should", etc in the following untested ideas.
If you have not been using 'rb' and 'wb' in your compression code, the probability that you have hosed yourself is rather high.
If there were any instances of '\x1a' in your original file, any data after the first such is lost -- but in that case it shouldn't fail on decompression (IOW this scenario doesn't match your symptoms).
If a Ctrl-Z was generated by zlib itself, this should cause an early EOF upon attempted decompression, which should of course cause an exception. In this case you may be able to gingerly reverse the process by reading the compressed file in binary mode and then substitute '\r\n' with '\n' [i.e. simulate text mode without the Ctrl-Z -> EOF gimmick]. Decompress the result. Edit Write the result out in TEXT mode. End edit
UPDATE 2 I can reproduce your symptoms -- with ANY level 1 to 9 -- with the following script:
import zlib, sys
fn = sys.argv[1]
level = int(sys.argv[2])
s1 = open(fn).read() # TEXT mode
s2 = zlib.compress(s1, level)
f = open(fn + '-ct', 'w') # TEXT mode
f.write(s2)
f.close()
# try to decompress in text mode
s1 = open(fn + '-ct').read() # TEXT mode
s2 = zlib.decompress(s1) # error -5
f = open(fn + '-dtt', 'w')
f.write(s2)
f.close()
Note: you will need a use a reasonably large text file (I used an 80kb source file) to ensure that the decompression result will contain a '\x1a'.
I can recover with this script:
import zlib, sys
fn = sys.argv[1]
# (1) reverse the text-mode write
# can't use text-mode read as it will stop at Ctrl-Z
s1 = open(fn, 'rb').read() # BINARY mode
s1 = s1.replace('\r\n', '\n')
# (2) reverse the compression
s2 = zlib.decompress(s1)
# (3) reverse the text mode read
f = open(fn + '-fixed', 'w') # TEXT mode
f.write(s2)
f.close()
NOTE: If there is a '\x1a' aka Ctrl-Z byte in the original file, and the file is read in text mode, that byte and all following bytes will NOT be included in the compressed file, and thus can NOT be recovered. For a text file (e.g. source code), this is no loss at all. For a binary file, you are most likely hosed.
Update 3 [following late revelation that there's an encryption/decryption layer involved in the problem]:
The "Error -5" message indicates that the data that you are trying to decompress has been mangled since it was compressed. If it's not caused by using text mode on the files, suspicion obviously(?) falls on your decryption and encryption wrappers. If you want help, you need to divulge the source of those wrappers. In fact what you should try to do is (like I did) put together a small script that reproduces the problem on more than one input file. Secondly (like I did) see whether you can reverse the process under what conditions. If you want help with the second stage, you need to divulge the problem-reproduction script.
I was looking for
python -c 'import sys,zlib;sys.stdout.write(zlib.decompress(sys.stdin.read()))'
wrote it myself; based on answers of zlib decompression in python
Okay sorry I wasn't clear enough. This is win32, python 2.6.2. I'm afraid I can't find the zlib file, but its whatever is included in the win32 binary release. And I don't have access to the original data -- I've been compressing my log files, and I'd like to get them back. As far as other software, I've naievely tried 7zip, but of course it failed, because it's zlib, not gzip (I couldn't any software to decompress zlib streams directly). I can't give a carbon copy of the traceback now, but it was (traced back to zlib.decompress(data)) zlib.error: Error: -3. Also, to be clear, these are static files, not streams as I made it sound earlier (so no transmission errors). And I'm afraid again I don't have the code, but I know I used zlib.compress(data, 9) (i.e. at the highest compression level -- although, interestingly it seems that not all the zlib output is 78da as you might expect since I put it on the highest level) and just zlib.decompress().
Ok sorry about my last post, I didn't have everything. And I can't edit my post because I didn't use OpenID. Anyways, here's some data:
1) Decompression traceback:
Traceback (most recent call last):
File "<my file>", line 5, in <module>
zlib.decompress(data)
zlib.error: Error -5 while decompressing data
2) Compression code:
#here you can assume the data is the data to be compressed/stored
data = encrypt(zlib.compress(data,9)) #a short wrapper around PyCrypto AES encryption
f = open("somefile", 'wb')
f.write(data)
f.close()
3) Decompression code:
f = open("somefile", 'rb')
data = f.read()
f.close()
zlib.decompress(decrypt(data)) #this yeilds the error in (1)

Categories

Resources