Python: convert hex bytestream to “int16" - python

So I'm working with incoming audio from Watson Text to Speech. I want to play the sound immediately when data arrives to Python with a websocket from nodeJS.
This is a example of data I'm sending with the websocket:
<Buffer e3 f8 28 f9 fa f9 5d fb 6c fc a6 fd 12 ff b3 00 b8 02 93 04 42 06 5b 07 e4 07 af 08 18 0a 95 0b 01 0d a2 0e a4 10 d7 12 f4 12 84 12 39 13 b0 12 3b 13 ... >
So the data arrives as a hex bytestream and I try to convert it to something that Sounddevice can read/play. (See documentation: The types 'float32', 'int32', 'int16', 'int8' and 'uint8' can be used for all streams and functions.) But how can I convert this?
I already tried something, but when I run my code I only hear some noise, nothing recognizable.
Here you can read some parts of my code:
def onMessage(self, payload, isBinary):
a = payload.encode('hex')
queue.put(a)
After I receive the bytesstream and convert to hex, I try to send the incoming bytestream to Sounddevice:
def stream_audio():
with sd.OutputStream(channels=1, samplerate=24000, dtype='int16', callback=callback):
sd.sleep(int(20 * 1000))
def callback(outdata, frames, time, status):
global reststuff, i, string
LENGTH = frames
while len(reststuff) < LENGTH:
a = queue.get()
reststuff += a
returnstring = reststuff[:LENGTH]
reststuff = reststuff[LENGTH:]
for char in returnstring:
i += 1
string += char
if i % 2 == 0:
print string
outdata[:] = int(string, 16)
string = ""

look at your stream of data:
e3 f8 28 f9 fa f9 5d fb 6c fc a6 fd 12 ff b3 00
b8 02 93 04 42 06 5b 07 e4 07 af 08 18 0a 95 0b
01 0d a2 0e a4 10 d7 12 f4 12 84 12 39 13 b0 12
3b 13
you see here that every two bytes the second one is starting with e/f/0/1 which means near zero (in two's complement).
So that's your most significant bytes, so your stream is little-endian!
you should consider that in your conversion.
If I have more data I would have tested but this is worth some miliseconds!

Related

Difference in result while reading same file with node and python

I have been trying to read the contents of the genesis.block given in this file of the Node SDK in Hyperledger Fabric using Python. However, whenever I try to read the file with Python by using
data = open("twoorgs.genesis.block").read()
The value of the data variable is as follows:
>>> data
'\n'
With nodejs using fs.readFileSync() I obtain an instance of Buffer() for the same file.
var data = fs.readFileSync('./twoorgs.genesis.block');
The result is
> data
<Buffer 0a 22 1a 20 49 63 63 ac 9c 9f 3e 48 2c 2c 6b 48 2b 1f 8b 18 6f a9 db ac 45 07 29 ee c0 bf ac 34 99 9e c2 56 12 e1 84 01 0a dd 84 01 0a d9 84 01 0a 79 ... >
How can I read this file successfully using Python?
You file has a 1a in it. This is Ctrl-Z, which is an end of file on Windows.
So try binary mode like:
data = open("twoorgs.genesis.block", 'rb').read()

Speed up python code

I have some text file in following format (network traffic collected by tcpdump):
1505372009.023944 00:1e:4c:72:b8:ae > 00:23:f8:93:c1:af, ethertype IPv4 (0x0800), length 97: (tos 0x0, ttl 64, id 5134, offset 0, flags [DF], proto TCP (6), length 83)
192.168.1.53.36062 > 74.125.143.139.443: Flags [P.], cksum 0x67fd (correct), seq 1255996541:1255996572, ack 1577943820, win 384, options [nop,nop,TS val 356377 ecr 746170020], length 31
0x0000: 0023 f893 c1af 001e 4c72 b8ae 0800 4500 .#......Lr....E.
0x0010: 0053 140e 4000 4006 8ab1 c0a8 0135 4a7d .S..#.#......5J}
0x0020: 8f8b 8cde 01bb 4adc fc7d 5e0d 830c 8018 ......J..}^.....
0x0030: 0180 67fd 0000 0101 080a 0005 7019 2c79 ..g.........p.,y
0x0040: a6a4 1503 0300 1a00 0000 0000 0000 04d1 ................
0x0050: c300 9119 6946 698c 67ac 47a9 368a 1748 ....iFi.g.G.6..H
0x0060: 1c .
and want to change it to:
1505372009.023944
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .
Here is what I have done:
import re
regexp_time =re.compile("\d\d\d\d\d\d\d\d\d\d.\d\d\d\d\d\d+")
regexp_hex = re.compile("(\t0x\d+:\s+)([0-9a-f ]+)+ ")
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt2.txt','w') as output:
for line in input:
if regexp_time.match(line):
output.write ("%s\n" % (line.split()[0]))
elif regexp_hex.match(line):
words = re.split(r'\s{2,}', line)
bytes=""
for byte in words[1].split():
if len(byte) == 4:
bytes += "%s%s %s%s "%(byte[0],byte[1],byte[2],byte[3])
elif len(byte) == 2:
bytes += "%s%s "%(byte[0],byte[1])
output.write ("%s %s %s \n" % (words[0].replace("0x","00"),"{:<47}".format (bytes),words[2].replace("\n","")))
input.close()
output.close()
Could some one help me in speed up?
Edit
Here is the new version of code depends on #Austin answer, It really speed up the code.
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt1.txt','w') as output:
for line in input:
if line[0].isdigit():
output.write (line[:16])
output.write ('\n')
elif line.startswith("\t0x"):#(Since there is line which is not hex and not start with timestamp I should check this as well)
offset = line[:10] # " 0x0000: "
words = line[10:51] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[51:] # " .#......Lr....E."
line = [offset.replace('x', '0', 1)]
for a,b,c,d,space in zip (words[0::5],words[1::5],words[2::5],words[3::5],words[4::5]):
line.append(a)
line.append(b)
line.append(space)
line.append(c)
line.append(d)
line.append(space)
line.append (chars)
output.write (''.join (line))
input.close()
output.close()
Here is the result:
1505372009.02394
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .
You haven't specified anything else about your file format, including what if any lines appear between blocks of packet data. So I'm going to assume that you just have paragraphs like the one you show, jammed together.
The best way to speed up something like this is to reduce the extra operations. You have a bunch! For example:
You use a regex to match the "start" line.
You use a split to extract the timestamp from the start line.
You use a %-format operator to write the timestamp out.
You use a different regex to match a "hex" line.
You use more than one split to parse the hex line.
You use various formatting operators to output the hex line.
If you're going to use regular expression matching, then I think you should just do one match. Create an alternate pattern (like a|b) that describes both lines. Use match.lastgroup or .lastindex to decide what got matched.
But your lines are so different that I don't think a regex is needed. Basically, you can decide what sort of line you have by looking at the very first character:
if line[0].isdigit():
# This is a timestamp line
else:
# This is a hex line
For timestamp processing, all you want to do is print out the 17 characters at the start of the line: 11 digits, a dot, and 6 more digits. So do that:
if line[0].isdigit():
output.write(line[:17], '\n')
For hex line processing, you want to make two kinds of changes: you want to replace the 'x' in the hex offset with a zero. That's easy:
hexline = line.replace('x', '0', 1) # Note: 1 replacement only!
Then, you want to insert spaces between the groups of 4 hex digits, and pad the short lines so the character display appears in the same column.
This is a place where regular expression replacement might help you. There's a limited number of occurrences, but it may be that the overhead of the Cpython interpreter costs more than the setup and teardown for a regex replacement. You probably should do some profiling on this.
That said, you can split the line into three parts. It's important to capture the trailing space on the middle part, though:
offset = line[:13] # " 0x0000: "
words = line[13:53] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[53:] # " .#......Lr....E."
You already know how to replace the 'x' in the offset, and there's nothing to be done to the chars portion of the line. So we'll leave those alone. The remaining task is to spread out the characters in the
words string. You can do that in various ways, but it seems easy to process the characters in chunks of 5 (4 hex digits plus a trailing space).
We can do this because we captured the trailing space on the words part. If not, you might have to use itertools.zip_longest(..., fill_value=''), but it's probably easier just to grab one more character.
With that done, you can do:
for a,b,c,d,space in zip(words[0::5], words[1::5], words[2::5], words[3::5], words[4::5]):
output.write(a, b, space, c, d, space)
Alternatively, instead of making all those calls you could accumulate the characters in a buffer and then write the buffer one time. Something like:
line = [offset]
for ...:
line.extend(a, b, space, c, d, space)
line.append(chars)
line.append('\n')
output.write(''.join(line))
That's fairly straightforward, but like I said, it may not perform quite as well as a regular-expression replacement. That would be due to the regex code running as "C" rather than python bytecode. So you should compare it against a pattern replacement like:
words = re.sub(r'(..)(..) ', '\1 \2 ', words)
Note that I didn't require hex digits, in order to cause any trailing "padding" spaces on the last line of a paragraph to expand in proportion.
Again, please check the performance against the zip version above!

Identify the contents a file through a program in python [duplicate]

This question already has answers here:
Tools to help reverse engineer binary file formats
(9 answers)
Closed 6 years ago.
I have a file here. To me it appears it is a binary file. This is raw file and I believe that it has the stock information in OHLCV (Open, High, Low, Close, Volume). Besides it may also have some text.
One of the entries that I could possibly have for OHLCV is
464.95, 468.3, 460, 465.65, 3957854
This is the code that I have tried. I dont fully understand about ASCII and Unicode.
input_file = "00063181.dat" # tata motors
with open(input_file, "rb") as fh:
buf = fh.read()
output_l = list(map(int , buf))
print (output_l)
My Doubt: How do I decode this file and make sense out of it? Is there any way for me to read this file through a program written in python and separate the text from int/float? I am using Python 3 and Win 10 64 bit.
You're looking to reverse engineer the structure of a binary file using Python. Since you've declared that the file is binary, it may prove difficult. You're going to need to examine the contents of the file and use your best intuition to try to infer the structure. The first thing you're going to want is a way to display each of the bytes of the file an a way that will help you understand the meaning.
Fortunately, someone has already written a tool to do this, hexdump. Install that package using pip.
The function you need from that package is hexdump, so let's import it the package and get help on the function.
>>> import hexdump
>>> help(hexdump.hexdump)
Help on function hexdump in module hexdump:
hexdump(data, result='print')
Transform binary data to the hex dump text format:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[x] data argument as a binary string
[x] data argument as a file like object
Returns result depending on the `result` argument:
'print' - prints line by line
'return' - returns single string
'generator' - returns generator that produces lines
Now you can start to explore the contents of your file. Use the slice operator to do it in chunks. For example, to render the contents of the first 1KB of your file:
>>> hexdump.hexdump(buf[:1024])
00000000: C3 8E C2 8F 22 13 C2 AA 66 2A 22 47 C3 94 C3 AA ...."...f*"G....
00000010: C3 89 C3 A0 C3 B1 C3 91 6A C2 A4 C3 BF 3C C2 AA ........j....<..
00000020: C2 91 73 C3 85 46 57 47 C2 88 C3 99 C2 B6 3E 2D ..s..FWG......>-
00000030: C3 BA 69 10 C2 93 C3 94 38 C3 81 7A 6A 43 30 7C ..i.....8..zjC0|
00000040: C3 BB C2 AA 01 2D C2 97 C3 83 C3 88 64 14 C3 9C .....-......d...
00000050: C2 AB C2 AA C3 A2 74 C2 85 5D C3 97 4E 64 68 C3 ......t..]..Ndh.
...
000003C0: 42 C2 8F 06 7F 12 33 7F 79 1E 2C 2A 0F C3 92 36 B.....3.y.,*...6
000003D0: C3 A6 C2 96 C2 93 C2 8B 43 C2 9F 4C C2 95 48 24 ........C..L..H$
000003E0: C2 B3 C2 82 26 C3 88 C3 BD C3 96 12 1E 5E 18 2E ....&........^..
000003F0: 37 C3 A7 C2 87 C3 AE 00 4F 3F C2 9C C3 A8 1C C2 7.......O?......
Hexdump has a nice property of rendering the byte position, the hex code, and then (if possible) the printable form of the character on the right.
Hopefully some of your text values will be visible there and that will give some clue as to how to reverse engineer your file.
Once you've started to determine how your file is structured, you can use the various string operators to manipulate your data. For example, if you find that your file is split into sections by the null byte (b'\x00'), you can get those sections thus:
>>> sections = buf.split(b'\x00')
There are a lot of things that you're likely to have to learn as you dig deeper, like character encodings, number encodings (including little-endian for integers and floating-point encoding for floating point numbers). You'll want to find some way to externally validate your results.
Best of luck.

Python Crypto - example script explanation

I've been trying to work out why the below code is failing to pad the IV with 16bytes. I've taken a look at the Crypto docs but I am none the wiser. I have found a few examples online but I don’t see the failing difference in the code below and the working examples (in Ruby). Any help would be appreciated.
import sys
from Crypto.Cipher import AES
from base64 import b64decode
key = """
4e 99 06 e8 fc b6 6c c9 fa f4 93 10 62 0f fe e8
f4 96 e8 06 cc 05 79 90 20 9b 09 a4 33 b6 6c 1b
"""
key.replace(" ","").replace("\n","").decode('hex')
password1 = "j1Uyj3Vx8TY9LtLZil2uAuZkFQA/4latT76ZwgdHdhw"
password1 += "=" * ((4 - len(password1) % 4) % 4)
password = b64decode(password1)
o = AES.new(key, AES.MODE_CBC).decrypt(password)
print o[:-ord(o[-1])].decode('utf16')
from Crypto.Cipher import AES
import base64
def rpad(s, fill='=', multiple=8):
"""
Pad s with the fill char so the length of the string
is a multiple of `multiple` (default 8).
"""
return s + fill * (-len(s) % multiple)
key = """
4e 99 06 e8 fc b6 6c c9 fa f4 93 10 62 0f fe e8
f4 96 e8 06 cc 05 79 90 20 9b 09 a4 33 b6 6c 1b
"""
key = key.replace(" ","").replace("\n","").decode('hex')
mode = AES.MODE_CBC
iv = "\x00"*16
enc = AES.new(key, mode, iv)
password = "j1Uyj3Vx8TY9LtLZil2uAuZkFQA/4latT76ZwgdHdhw"
decoded = base64.b64decode(rpad(password, multiple=4))
o = enc.decrypt(decoded)
print(o[:-ord(o[-1])].decode('utf16'))
prints
Local*P4ssword!
As Jon Clements pointed out, key.replace(...) returns a string. You need to reassign that string
to key, or else the replacement is done for naught.
Not sure if this is your issue, but you may have a typo. :) password2 should read password1 I think.

Python binary data reading

A urllib2 request receives binary response as below:
00 00 00 01 00 04 41 4D 54 44 00 00 00 00 02 41
97 33 33 41 99 5C 29 41 90 3D 71 41 91 D7 0A 47
0F C6 14 00 00 01 16 6A E0 68 80 41 93 B4 05 41
97 1E B8 41 90 7A E1 41 96 8F 57 46 E6 2E 80 00
00 01 16 7A 53 7C 80 FF FF
Its structure is:
DATA, TYPE, DESCRIPTION
00 00 00 01, 4 bytes, Symbol Count =1
00 04, 2 bytes, Symbol Length = 4
41 4D 54 44, 6 bytes, Symbol = AMTD
00, 1 byte, Error code = 0 (OK)
00 00 00 02, 4 bytes, Bar Count = 2
FIRST BAR
41 97 33 33, 4 bytes, Close = 18.90
41 99 5C 29, 4 bytes, High = 19.17
41 90 3D 71, 4 bytes, Low = 18.03
41 91 D7 0A, 4 bytes, Open = 18.23
47 0F C6 14, 4 bytes, Volume = 3,680,608
00 00 01 16 6A E0 68 80, 8 bytes, Timestamp = November 23,2007
SECOND BAR
41 93 B4 05, 4 bytes, Close = 18.4629
41 97 1E B8, 4 bytes, High = 18.89
41 90 7A E1, 4 bytes, Low = 18.06
41 96 8F 57, 4 bytes, Open = 18.82
46 E6 2E 80, 4 bytes, Volume = 2,946,325
00 00 01 16 7A 53 7C 80, 8 bytes, Timestamp = November 26,2007
TERMINATOR
FF FF, 2 bytes,
How to read binary data like this?
Thanks in advance.
Update:
I tried struct module on first 6 bytes with following code:
struct.unpack('ih', response.read(6))
(16777216, 1024)
But it should output (1, 4). I take a look at the manual but have no clue what was wrong.
So here's my best shot at interpreting the data you're giving...:
import datetime
import struct
class Printable(object):
specials = ()
def __str__(self):
resultlines = []
for pair in self.__dict__.items():
if pair[0] in self.specials: continue
resultlines.append('%10s %s' % pair)
return '\n'.join(resultlines)
head_fmt = '>IH6sBH'
head_struct = struct.Struct(head_fmt)
class Header(Printable):
specials = ('bars',)
def __init__(self, symbol_count, symbol_length,
symbol, error_code, bar_count):
self.__dict__.update(locals())
self.bars = []
del self.self
bar_fmt = '>5fQ'
bar_struct = struct.Struct(bar_fmt)
class Bar(Printable):
specials = ('header',)
def __init__(self, header, close, high, low,
open, volume, timestamp):
self.__dict__.update(locals())
self.header.bars.append(self)
del self.self
self.timestamp /= 1000.0
self.timestamp = datetime.date.fromtimestamp(self.timestamp)
def showdata(data):
terminator = '\xff' * 2
assert data[-2:] == terminator
head_data = head_struct.unpack(data[:head_struct.size])
try:
assert head_data[4] * bar_struct.size + head_struct.size == \
len(data) - len(terminator)
except AssertionError:
print 'data length is %d' % len(data)
print 'head struct size is %d' % head_struct.size
print 'bar struct size is %d' % bar_struct.size
print 'number of bars is %d' % head_data[4]
print 'head data:', head_data
print 'terminator:', terminator
print 'so, something is wrong, since',
print head_data[4] * bar_struct.size + head_struct.size, '!=',
print len(data) - len(terminator)
raise
head = Header(*head_data)
for i in range(head.bar_count):
bar_substr = data[head_struct.size + i * bar_struct.size:
head_struct.size + (i+1) * bar_struct.size]
bar_data = bar_struct.unpack(bar_substr)
Bar(head, *bar_data)
assert len(head.bars) == head.bar_count
print head
for i, x in enumerate(head.bars):
print 'Bar #%s' % i
print x
datas = '''
00 00 00 01 00 04 41 4D 54 44 00 00 00 00 02 41
97 33 33 41 99 5C 29 41 90 3D 71 41 91 D7 0A 47
0F C6 14 00 00 01 16 6A E0 68 80 41 93 B4 05 41
97 1E B8 41 90 7A E1 41 96 8F 57 46 E6 2E 80 00
00 01 16 7A 53 7C 80 FF FF
'''
data = ''.join(chr(int(x, 16)) for x in datas.split())
showdata(data)
this emits:
symbol_count 1
bar_count 2
symbol AMTD
error_code 0
symbol_length 4
Bar #0
volume 36806.078125
timestamp 2007-11-22
high 19.1700000763
low 18.0300006866
close 18.8999996185
open 18.2299995422
Bar #1
volume 29463.25
timestamp 2007-11-25
high 18.8899993896
low 18.0599994659
close 18.4629001617
open 18.8199901581
...which seems to be pretty close to what you want, net of some output formatting details. Hope this helps!-)
>>> data
'\x00\x00\x00\x01\x00\x04AMTD\x00\x00\x00\x00\x02A\x9733A\x99\\)A\x90=qA\x91\xd7\nG\x0f\xc6\x14\x00\x00\x01\x16j\xe0h\x80A\x93\xb4\x05A\x97\x1e\xb8A\x90z\xe1A\x96\x8fWF\xe6.\x80\x00\x00\x01\x16zS|\x80\xff\xff'
>>> from struct import unpack, calcsize
>>> scount, slength = unpack("!IH", data[:6])
>>> assert scount == 1
>>> symbol, error_code = unpack("!%dsb" % slength, data[6:6+slength+1])
>>> assert error_code == 0
>>> symbol
'AMTD'
>>> bar_count = unpack("!I", data[6+slength+1:6+slength+1+4])
>>> bar_count
(2,)
>>> bar_format = "!5fQ"
>>> from collections import namedtuple
>>> Bar = namedtuple("Bar", "Close High Low Open Volume Timestamp")
>>> b = Bar(*unpack(bar_format, data[6+slength+1+4:6+slength+1+4+calcsize(bar_format)]))
>>> b
Bar(Close=18.899999618530273, High=19.170000076293945, Low=18.030000686645508, Open=18.229999542236328, Volume=36806.078125, Timestamp=1195794000000L)
>>> import time
>>> time.ctime(b.Timestamp//1000)
'Fri Nov 23 08:00:00 2007'
>>> int(b.Volume*100 + 0.5)
3680608
>>> struct.unpack('ih', response.read(6))
(16777216, 1024)
You are unpacking big-endian data on a little-endian machine. Try this instead:
>>> struct.unpack('!IH', response.read(6))
(1L, 4)
This tells unpack to consider the data in network-order (big-endian). Also, the values of counts and lengths can not be negative, so you should should use the unsigned variants in your format string.
Take a look at the struct.unpack in the struct module.
Use pack/unpack functions from "struct" package. More info here http://docs.python.org/library/struct.html
Bye!
As it was already mentioned, struct is the module you need to use.
Please read its documentation to learn about byte ordering, etc.
In your example you need to do the following (as your data is big-endian and unsigned):
>>> import struct
>>> x = '\x00\x00\x00\x01\x00\x04'
>>> struct.unpack('>IH', x)
(1, 4)

Categories

Resources