does base64 encoding hash the input? - python

I am trying to debug why something is not quite working and observed thatb64encode does not seem to work quite as I imagined:
import base64
base64.b64encode( bytes("the cat sat on the mat", "utf-8") )
>> b'dGhlIGNhdCBzYXQgb24gdGhlIG1hdA=='
base64.b64encode( bytes("cat sat on the mat", "utf-8") )
>> b'Y2F0IHNhdCBvbiB0aGUgbWF0'
The second input string has only a small difference at the start, so why is it it that the output for each of these strings contains virtually no similarity? Would have expected only the start of each output to be a bit different.

Base64 maps 3 input bytes to 4 output bytes.
Since you added 4 input bytes, the means all of the remaining bytes "shifted" into different locations in the output.
Notice the == (padding) on the first example which went away on the second.
Try adding or removing multiples of 3 input bytes:
cat sat on the mat
my cat sat on the mat

Base64 is a fully deterministic, reversible transformation, but it does not operate on a per-character basis (as you can also observe from the output length not being a multiple of the input).
Rather, groups of three bytes (24 bits) are encoded at a time by turning them into four 6-bit numbers (hence base 64 = 2^6). If the input length is not a multiple of three, it is padded and indicated as such by putting = characters at the end of the output.
Therefore, common substrings in different inputs will only show up as a common substring in the output if they are aligned on this three-byte frame, and grouped into the same triples.
the cat sat on the mat
dGhlIGNhdCBzYXQgb24gdGhlIG1hdA==
he cat sat on the mat
aGUgY2F0IHNhdCBvbiB0aGUgbWF0
e cat sat on the mat
ZSBjYXQgc2F0IG9uIHRoZSBtYXQ=
cat sat on the mat
IGNhdCBzYXQgb24gdGhlIG1hdA==
Observe that if you truncate exactly three characters ("the", leaving the space), the output becomes recognizable again.

Related

Base56 conversion etc

It seems base58 and base56 conversion treat input data as a single Big Endian number; an unsigned bigint number.
If I'm encoding some integers into shorter strings by trying to use base58 or base56 it seems in some implementations the integer is taken as a native (little endian in my case) representation of bytes and then converted to a string, while in other implementations the number is converted to big endian representation first. It seems the loose specifications of these encoding don't clarify which approach is right. Is there an explicit specification of which to do, or a more wildly popular option of the two I'm not aware of?
I was trying to compare some methods of making a short URL. The source is actually a 10 digit number that's less than 4 billion. In this case I was thinking to make it an unsigned 4 byte integer, possibly Little Endian, and then encode it with a few options (with alphabets):
base64 A…Za…z0…9+/
base64 url-safe A…Za…z0…9-_
Z85 0…9a…zA…Z.-:+=^!/*?&<>()[]{}#%$#
base58 1…9A…HJ…NP…Za…km…z (excluding 0IOl+/ from base64 & reordered)
base56 2…9A…HJ…NP…Za…kmnp…z (excluding 1o from base58)
So like, base16, base32 and base64 make pretty good sense in that they're taking 4, 5 or 6 bits of input data at a time and looking them up in an alphabet index. The latter uses 4 symbols per 3 bytes. Straightforward, and this works for any data.
The other 3 have me finding various implementations that disagree with each other as to the right output. The problem appears to be that no amount of bytes has a fixed number of lookups in these. EG taking 2^1 to 2^100 and getting the remainders for 56, 58 and 85 results in no remainders of 0.
Z85 (ascii85 and base85 etal.) approach this by grabbing 4 bytes at a time and encoding them to 5 symbols and accepting some waste. But there's byte alignment to some degree here (base64 has alignment per 16 symbols, Z85 gets there with 5). But the alphabet is … not great for urls, command-line, nor sgml/xml use.
base58 and base56 seem intent on treating the input bytes like a Big Endian ordered bigint and repeating: % base; lookup; -= % base; /= base on the input bigint. Which… I mean, I think that ends up modifying most of the input for every iteration.
For my input that's not a huge performance concern though.
Because we shouldn't treat the input as string data, or we get output longer than the 10 digit decimal number input and what's the point in that, does anyone know of any indication of which kind of processing for the output results in something canonical for base56 or base58?
Have the Little Endian 4 byte word of the 10 digit number (<4*10^10) turned into a sequence of bytes that represent a different number if Big Endian, and convert that by repeating the steps.
Have the 10 digit number (<4*10^10) represented in 4 bytes Big Endian before converting that by repeating the steps.
I'm leaning towards going the route of the 2nd way.
For example given the number: 3003295320
The little endian representation is 58 a6 02 b3
The big endian representation is b3 02 a6 58, Meaning
base64 gives:
>>> base64.b64encode(int.to_bytes(3003295320,4,'little'))
b'WKYCsw=='
>>> base64.b64encode(int.to_bytes(3003295320,4,'big'))
b'swKmWA=='
>>> base64.b64encode('3003295320'.encode('ascii'))
b'MzAwMzI5NTMyMA==' # Definitely not using this
Z85 gives:
>>> encode(int.to_bytes(3003295320,4,'little'))
b'sF=ea'
>>> encode(int.to_bytes(3003295320,4,'big'))
b'VJv1a'
>>> encode('003003295320'.encode('ascii')) # padding to 4 byte boundary
b'fFCppfF+EAh8v0w' # Definitely not using this
base58 gives:
>>> base58.b58encode(int.to_bytes(3003295320,4,'little'))
b'3GRfwp'
>>> base58.b58encode(int.to_bytes(3003295320,4,'big'))
b'5aPg4o'
>>> base58.b58encode('3003295320')
b'3soMTaEYSLkS4w' # Still not using this
base56 gives:
>>> b56encode(int.to_bytes(3003295320,4,'little'))
b'4HSgyr'
>>> b56encode(int.to_bytes(3003295320,4,'big'))
b'6bQh5q'
>>> b56encode('3003295320')
b'4uqNUbFZTMmT5y' # Longer than 10 digits so...

Problems parsing binary data

From a simulation tool I get a binary file containing some measurement points. What I need to do is: parse the measurement values and store them in a list.
According to the documentation of the tool, the data structure of the file looks like this:
First 16 bytes are always the same:
Bytes 0 - 7 char[8] Header
Byte 8 u. char Version
Byte 9 u. char Byte-order (0 for little endian)
Bytes 10 - 11 u. short Record size
Bytes 12 - 15 char[4] Reserved
The quantities are following: (for example one double and one float):
Bytes 16 - 23 double Value of quantity one
Bytes 24 - 27 float Value of quantity two
Bytes 28 - 35 double Next value of quantity one
Bytes 36 - 39 float Next value of quantity two
I also know, that the encoding is little endian.
In my usecase there are two quantities but both of them are floats.
My code so far looks like this:
def parse(self, filePath):
infoFilePath = filePath+ '.info'
quantityList = self.getQuantityList(infoFilePath)
blockSize = 0
for quantity in quantityList:
blockSize += quantity.bytes
with open(filePath, 'r') as ergFile:
# read the first 16 bytes, as they are not needed now
ergFile.read(16)
# now read the rest of the file block wise
block = ergFile.read(blockSize)
while len(block) == blockSize:
for q in quantityList:
q.values.append(np.fromstring(block[:q.bytes], q.dataType)[0])
block = block[q.bytes:]
block = ergFile.read(blockSize)
return quantityList
QuantityList comes from a previous function and contains the quantity structure. Each quantity has a name, dataType, lenOfBytes called bytes and a prepared list for the values called values.
So in my usecase there are two quantities with:
dataType = "<f"
bytes = 4
values=[]
After the parse function has finished I plot the first quantity with matplotlib. As you can see from the attached Images something went wrong during the parsing.
My parsed values:
The reference:
But I am not able to find my fault.
i was able to solve my problem this morning.
The solution couldnt be any easier.
I changed
...
with open(ergFilePath, 'r') as ergFile:
...
to:
...
with open(ergFilePath, 'rb') as ergFile:
...
Notice the change from 'r' to 'rb' as mode.
The python docu made Things clear for me:
Thus, when opening a binary file, you should append 'b' to the mode
value to open the file in binary mode, which will improve portability.
(Appending 'b' is useful even on systems that don’t treat binary and
text files differently, where it serves as documentation.)
So the final parsed values look like this:
Final values

Python read a binary file and decode

I am quite new in python and I need to solve this simple problem. Already there are several similar questions but still I cannot solve it.
I need to read a binary file, which is composed by several blocks of bytes. For example the header is composed by 6 bytes and I would like to extract those 6 bytes and transform ins sequence of binary characters like 000100110 011001 for example.
navatt_dir='C:/PROCESSING/navatt_read/'
navatt_filename='OSPS_FRMT_NAVATT____20130621T100954_00296_caseB.bin'
navatt_path=navatt_dir+navatt_filename
navatt_file=open(navatt_path, 'rb')
header=list(navatt_file.read(6))
print header
As result of the list i have the following
%run C:/PROCESSING/navatt_read/navat_read.py
['\t', 'i', '\xc0', '\x00', '\x00', 't']
which is not what i want.
I would like also to read a particular value in the binary file knowing the position and the length, without reading all the file. IS it possible
thanks
ByteArray
A bytearray is a mutable sequence of bytes (Integers where 0 ≤ x ≤ 255). You can construct a bytearray from a string (If it is not a byte-string, you will have to provide encoding), an iterable of byte-sized integers, or an object with a buffer interface. You can of course just build it manually as well.
An example using a byte-string:
string = b'DFH'
b = bytearray(string)
# Print it as a string
print b
# Prints the individual bytes, showing you that it's just a list of ints
print [i for i in b]
# Lets add one to the D
b[0] += 1
# And print the string again to see the result!
print b
The result:
DFH
[68, 70, 72]
EFH
This is the type you want if you want raw byte manipulation. If what you want is to read 4 bytes as a 32bit int, one would use the struct module, with the unpack method, but I usually just shift them together myself from a bytearray.
Printing the header in binary
What you seem to want is to take the string you have, convert it to a bytearray, and print them as a string in base 2/binary.
So here is a short example for how to write the header out (I read random data from a file named "dump"):
with open('dump', 'rb') as f:
header = f.read(6)
b = bytearray(header)
print ' '.join([bin(i)[2:].zfill(8) for i in b])
After converting it to a bytearray, I call bin() on every single one, which gives back a string with the binary representation we need, in the format of "0b1010". I don't want the "0b", so I slice it off with [2:]. Then, I use the string method zfill, which allows me to have the required amount of 0's prepended for the string to be 8 long (which is the amount of bits we need), as bin will not show any unneeded zeroes.
If you're new to the language, the last line might look quite mean. It uses list comprehension to make a list of all the binary strings we want to print, and then join them into the final string with spaces between the elements.
A less pythonic/convoluted variant of the last line would be:
result = []
for byte in b:
string = bin(i)[2:] # Make a binary string and slice the first two bytes
result.append(string.zfill(8)) # Append a 0-padded version to the results list
# Join the array to a space separated string and print it!
print ' '.join(result)
I hope this helps!

Python Struct, size changed by alignment.

Here's the hex code I am trying to unpack.
b'ABCDFGHa\x00a\x00a\x00a\x00a\x00\x00\x00\x00\x00\x00\x01' (it's not supposed to make any sense)
labels = unpack('BBBBBBBHHHHH5sB', msg)
struct.error: unpack requires a bytes argument of length 24
From what I counted, both of those are length = 23, both the format in my unpack function and the length of the hex values. I don't understand.
Thanks in advance
Most processors access data faster when the data is on natural boundaries, meaning data of size 2 should be on even addresses, data of size 4 should be accessed on addresses divisible by four, etc.
struct by default maintains this alignment. Since your structure starts out with 7 'B', a padding byte is added to align the next 'H' on an even address. To prevent this in Python, precede your string with '='.
Example:
>>> import struct
>>> struct.calcsize('BBB')
3
>>> struct.calcsize('BBBH')
6
>>> struct.calcsize('=BBBH')
5
I think H is enforcing 2-byte alignment after your 7 B
Aha, the alignment info is at the top of http://docs.python.org/library/struct.html, not down by the definition of the format characters.

editing a wav files using python

Between each word in the wav file I have full silence (I checked with Hex workshop and silence is represented with 0's).
How can I cut the non-silence sound?
I'm programming using python.
Thanks!
Python has a wav module. You can use it to open a wav file for reading and use the `getframes(1)' command to walk through the file frame by frame.
import wave
w = wave.open('beeps.wav', 'r')
for i in range():
frame = w.readframes(1)
The frame returned will be a byte string with hex values in it. If the file is stereo the result will look something like this (4 bytes):
'\xe2\xff\xe2\xff'
If its mono, it will have half the data (2 bytes):
'\xe2\xff'
Each channel is 2 bytes long because the audio is 16 bit. If is 8 bit, each channel will only be one byte. You can use the getsampwidth() method to determine this. Also, getchannels() will determine if its mono or stereo.
You can loop over these bytes to see if they all equal zero, meaning both channels are silent. In the following example I use the ord() function to convert the '\xe2' hex values to integers.
import wave
w = wave.open('beeps.wav', 'r')
for i in range(w.getnframes()):
### read 1 frame and the position will updated ###
frame = w.readframes(1)
all_zero = True
for j in range(len(frame)):
# check if amplitude is greater than 0
if ord(frame[j]) > 0:
all_zero = False
break
if all_zero:
# perform your cut here
print 'silence found at frame %s' % w.tell()
print 'silence found at second %s' % (w.tell()/w..getframerate())
It is worth noting that a single frame of silence doesn't necessarily denote empty space since the amplitude may cross the 0 mark normal frequencies. Therefore, it is recommended that a certain number of frames at 0 be observed before deciding if the region is, in fact, silent.
I have been doing some research on this topic for a project I'm working on and I came across a few problems with the solution provided, namely the method for determining silence is incorrect. A "more correct" implementation would be:
import struct
import wave
wave_file = wave.open("sound_file.wav", "r")
for i in range(wave_file.getnframes()):
# read a single frame and advance to next frame
current_frame = wave_file.readframes(1)
# check for silence
silent = True
# wave frame samples are stored in little endian**
# this example works for a single channel 16-bit per sample encoding
unpacked_signed_value = struct.unpack("<h", current_frame) # *
if abs(unpacked_signed_value[0]) > 500:
silent = False
if silent:
print "Frame %s is silent." % wave_file.tell()
else
print "Frame %s is not silent." % wave_file.tell()
References and Useful Links
*Struct Unpacking will be useful here: https://docs.python.org/2/library/struct.html
**A good reference I found explaining the format of wave files for dealing with different size bit-encodings and multiple channels is: http://www.piclist.com/techref/io/serial/midi/wave.html
Using the built-in ord() function in Python on the first element of the string object returned by the readframes(x) method will not work correctly.
Another key point is that multiple channel audio is interleaved and thus a little extra logic is needed for dealing with channels. Again, the link above goes into detail about this.
Hopefully this helps someone in the future.
Here are some of the more important points from the link, and what I found helpful.
Data Organization
All data is stored in 8-bit bytes, arranged in Intel 80x86 (ie, little endian) format. The bytes of multiple-byte values are stored with the low-order (ie, least significant) bytes first. Data bits are as follows (ie, shown with bit numbers on top):
7 6 5 4 3 2 1 0
+-----------------------+
char: | lsb msb |
+-----------------------+
7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
+-----------------------+-----------------------+
short: | lsb byte 0 | byte 1 msb |
+-----------------------+-----------------------+
7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24
+-----------------------+-----------------------+-----------------------+-----------------------+
long: | lsb byte 0 | byte 1 | byte 2 | byte 3 msb |
+-----------------------+-----------------------+-----------------------+-----------------------+
Interleaving
For multichannel sounds (for example, a stereo waveform), single sample points from each channel are interleaved. For example, assume a stereo (ie, 2 channel) waveform. Instead of storing all of the sample points for the left channel first, and then storing all of the sample points for the right channel next, you "mix" the two channels' sample points together. You would store the first sample point of the left channel. Next, you would store the first sample point of the right channel. Next, you would store the second sample point of the left channel. Next, you would store the second sample point of the right channel, and so on, alternating between storing the next sample point of each channel. This is what is meant by interleaved data; you store the next sample point of each of the channels in turn, so that the sample points that are meant to be "played" (ie, sent to a DAC) simultaneously are stored contiguously.
See also How to edit raw PCM audio data without an audio library?
I have no experience with this, but have a look at the wave module present in the standard library. That may do what you want. Otherwise you'll have to read the file as a byte stream an cut out sequences of 0-bytes (but you cannot just cut out all 0-bytes, as that would invalidate the file...)
You might want to try using sox, a command-line sound processing tool. It has many modes, one of them is silence:
silence: Removes silence from the beginning, middle, or end of a sound file. Silence is anything below a specified threshold.
It supports multiple sound formats and it's quite fast, so parsing large files shouldn't be a problem.
To remove silence from the middle of a file, specify a below_periods that is negative. This value is then treated as a positive value and is also used to indicate the effect should restart processing as specified by the above_periods, making it suitable for removing periods of silence in the middle of the sound file.
I haven't found any python building for libsox, though, but You can use it as You use all command line programs in python (or You can rewrite it - use sox sources for guidance then).
You will need to come up with some threshold value of a minimum number of consecutive zeros before you cut them. Otherwise you'll be removing perfectly valid zeros from the middle of normal audio data. You can iterate through the wave file, copying any non-zero values, and buffering up zero values. When you're buffering zeroes and eventually come across the next non-zero, if the buffer has fewer samples that the threshold, copy them over, otherwise discard it.
Python is not a great tool for this sort of task though. :(

Categories

Resources