Unknown event in MIDI file - python

As I've posted about before, I am writing a MIDI parser in Python. I am encountering an error where my parser is getting stuck because it's trying to read an event called 2a, but such an event does not exist. below is an excerpt from the MIDI file in question:
5d7f 00b5 5d7f 00b6 5d7f 00b1 5d00 00b9
5d00 8356 9923 7f00 2a44 0192 367f 0091
237f 0099 4640 0092 2f7c 0099 3f53 0b3f
I have parsed the file by hand, and I am getting stuck in the same spot as my parser! The MIDI file plays, so I know it's valid, but I'm certain that I am reading the events wrong.

The Standard MIDI Files 1.0 specification says:
Running status is used: status bytes of MIDI channel messages may be omitted if the preceding event is a MIDI channel message with the same status. The first event in each MTrk chunk must specify status. Delta-time is not considered an event itself: it is an integral part of the syntax for an MTrk event. Notice that running status occurs across delta-times.
Your excerpt would be decoded as follows:
delta <- event ------->
time status parameters
----- ------ ----------
... 5d 7f
00 b5 5d 7f
00 b6 5d 7f
00 b1 5d 00
00 b9 5d 00
83 56 99 23 7f
00 2a 44
01 92 36 7f
00 91 23 7f
00 99 46 40
00 92 2f 7c
00 99 3f 53
0b 3f ...

Related

How to write n bytes to a binary file in python 2.7

I am trying to use f.write(struct.pack()) to write n bytes to a binary file but not quite sure how to do that? Any example or sample would be helpful.
You don't really explain your exact problem or what you tried and which error messages you encountered:
The solution should look something like:
with open("filename", "wb") as fout:
fout.write(struct.pack(format, data, ...))
If you explain what data exactly you want to dump, then I can elaborate on the solution
If your data is just a hex string, then you do not need struct, you just use decode.
Please refer to SO question hexadecimal string to byte array in python
example for python 2.7:
hex_str = "414243444500ff"
bytestring = hex_str.decode("hex")
with open("filename", "wb") as fout:
fout.write(bytestring)
The below worked for me:
reserved = "Reserved_48_Bytes"
f.write(struct.pack("48s", reserved))
Output:
hexdump -C output.bin
00000030 52 65 73 65 72 76 65 64 5f 34 38 5f 42 79 74 65 |Reserved_48_Byte|
00000040 73 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |s...............|
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

Internet checksum -- Adding hex numbers together for checksum

I came across the following example of creating an Internet Checksum:
Take the example IP header 45 00 00 54 41 e0 40 00 40 01 00 00 0a 00 00 04 0a 00 00 05:
Adding the fields together yields the two’s complement sum 01 1b 3e.
Then, to convert it to one’s complement, the carry-over bits are added to the first 16-bits: 1b 3e + 01 = 1b 3f.
Finally, the one’s complement of the sum is taken, resulting to the checksum value e4c0.
I was wondering how the IP header is added together to get 01 1b 3e?
Split your IP header into 16-bit parts.
45 00
00 54
41 e0
40 00
40 01
00 00
0a 00
00 04
0a 00
00 05
The sum is 01 1b 3e. You might want to look at how packet header checksums are being calculated here https://en.m.wikipedia.org/wiki/IPv4_header_checksum.
The IP header is added together with carry in hexadecimal numbers of 4 digits.
i.e. the first 3 numbers that are added are 0x4500 + 0x0054 + 0x41e0 +...

pyside2-uic null bytes in output

I'm trying to convert Qt .ui files made using Qt Designer with pyside2-uic but the output starts with 2 garbage bytes then every other byte is a null.
Here's the start of the output:
FF FE 23 00 20 00 2D 00 2A 00 2D 00 20 00 63 00 6F 00 64 00 69 00 6E 00 67 00 3A 00 20 00 75 00 74 00 66 00 2D 00 38 00 20 00 2D 00 2A 00 2D 00 0D 00 0A 00 0D 00 0A 00 23 00 20 00 46 00 6F 00
If I remove the first 2 bytes and all the nulls the it works as expected.
I'm using Python 3.7 and the newest version of pyside2, is there any way to get pyside2-uic to output a valid file without having to run it through another script to pull out all the garbage?
FYI, issue seems to be UTF-8 encoding (when using -o), vs. UTF-16 LE (output redirect in PowerShell).
This also matches to above ... every byte has a 00 with it (16 bit vs. 8 bit).
This bug(?) only occurs when pyside2-uic is run in powershell and the output is redirected to a file.
If using powershell use the -o option to specify an output file. Both methods work fine from a normal command prompt.
In pyside2-uic mainwindow.ui -o MainWindow.py
Use -o instead of >

Speed up python code

I have some text file in following format (network traffic collected by tcpdump):
1505372009.023944 00:1e:4c:72:b8:ae > 00:23:f8:93:c1:af, ethertype IPv4 (0x0800), length 97: (tos 0x0, ttl 64, id 5134, offset 0, flags [DF], proto TCP (6), length 83)
192.168.1.53.36062 > 74.125.143.139.443: Flags [P.], cksum 0x67fd (correct), seq 1255996541:1255996572, ack 1577943820, win 384, options [nop,nop,TS val 356377 ecr 746170020], length 31
0x0000: 0023 f893 c1af 001e 4c72 b8ae 0800 4500 .#......Lr....E.
0x0010: 0053 140e 4000 4006 8ab1 c0a8 0135 4a7d .S..#.#......5J}
0x0020: 8f8b 8cde 01bb 4adc fc7d 5e0d 830c 8018 ......J..}^.....
0x0030: 0180 67fd 0000 0101 080a 0005 7019 2c79 ..g.........p.,y
0x0040: a6a4 1503 0300 1a00 0000 0000 0000 04d1 ................
0x0050: c300 9119 6946 698c 67ac 47a9 368a 1748 ....iFi.g.G.6..H
0x0060: 1c .
and want to change it to:
1505372009.023944
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .
Here is what I have done:
import re
regexp_time =re.compile("\d\d\d\d\d\d\d\d\d\d.\d\d\d\d\d\d+")
regexp_hex = re.compile("(\t0x\d+:\s+)([0-9a-f ]+)+ ")
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt2.txt','w') as output:
for line in input:
if regexp_time.match(line):
output.write ("%s\n" % (line.split()[0]))
elif regexp_hex.match(line):
words = re.split(r'\s{2,}', line)
bytes=""
for byte in words[1].split():
if len(byte) == 4:
bytes += "%s%s %s%s "%(byte[0],byte[1],byte[2],byte[3])
elif len(byte) == 2:
bytes += "%s%s "%(byte[0],byte[1])
output.write ("%s %s %s \n" % (words[0].replace("0x","00"),"{:<47}".format (bytes),words[2].replace("\n","")))
input.close()
output.close()
Could some one help me in speed up?
Edit
Here is the new version of code depends on #Austin answer, It really speed up the code.
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt1.txt','w') as output:
for line in input:
if line[0].isdigit():
output.write (line[:16])
output.write ('\n')
elif line.startswith("\t0x"):#(Since there is line which is not hex and not start with timestamp I should check this as well)
offset = line[:10] # " 0x0000: "
words = line[10:51] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[51:] # " .#......Lr....E."
line = [offset.replace('x', '0', 1)]
for a,b,c,d,space in zip (words[0::5],words[1::5],words[2::5],words[3::5],words[4::5]):
line.append(a)
line.append(b)
line.append(space)
line.append(c)
line.append(d)
line.append(space)
line.append (chars)
output.write (''.join (line))
input.close()
output.close()
Here is the result:
1505372009.02394
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .
You haven't specified anything else about your file format, including what if any lines appear between blocks of packet data. So I'm going to assume that you just have paragraphs like the one you show, jammed together.
The best way to speed up something like this is to reduce the extra operations. You have a bunch! For example:
You use a regex to match the "start" line.
You use a split to extract the timestamp from the start line.
You use a %-format operator to write the timestamp out.
You use a different regex to match a "hex" line.
You use more than one split to parse the hex line.
You use various formatting operators to output the hex line.
If you're going to use regular expression matching, then I think you should just do one match. Create an alternate pattern (like a|b) that describes both lines. Use match.lastgroup or .lastindex to decide what got matched.
But your lines are so different that I don't think a regex is needed. Basically, you can decide what sort of line you have by looking at the very first character:
if line[0].isdigit():
# This is a timestamp line
else:
# This is a hex line
For timestamp processing, all you want to do is print out the 17 characters at the start of the line: 11 digits, a dot, and 6 more digits. So do that:
if line[0].isdigit():
output.write(line[:17], '\n')
For hex line processing, you want to make two kinds of changes: you want to replace the 'x' in the hex offset with a zero. That's easy:
hexline = line.replace('x', '0', 1) # Note: 1 replacement only!
Then, you want to insert spaces between the groups of 4 hex digits, and pad the short lines so the character display appears in the same column.
This is a place where regular expression replacement might help you. There's a limited number of occurrences, but it may be that the overhead of the Cpython interpreter costs more than the setup and teardown for a regex replacement. You probably should do some profiling on this.
That said, you can split the line into three parts. It's important to capture the trailing space on the middle part, though:
offset = line[:13] # " 0x0000: "
words = line[13:53] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[53:] # " .#......Lr....E."
You already know how to replace the 'x' in the offset, and there's nothing to be done to the chars portion of the line. So we'll leave those alone. The remaining task is to spread out the characters in the
words string. You can do that in various ways, but it seems easy to process the characters in chunks of 5 (4 hex digits plus a trailing space).
We can do this because we captured the trailing space on the words part. If not, you might have to use itertools.zip_longest(..., fill_value=''), but it's probably easier just to grab one more character.
With that done, you can do:
for a,b,c,d,space in zip(words[0::5], words[1::5], words[2::5], words[3::5], words[4::5]):
output.write(a, b, space, c, d, space)
Alternatively, instead of making all those calls you could accumulate the characters in a buffer and then write the buffer one time. Something like:
line = [offset]
for ...:
line.extend(a, b, space, c, d, space)
line.append(chars)
line.append('\n')
output.write(''.join(line))
That's fairly straightforward, but like I said, it may not perform quite as well as a regular-expression replacement. That would be due to the regex code running as "C" rather than python bytecode. So you should compare it against a pattern replacement like:
words = re.sub(r'(..)(..) ', '\1 \2 ', words)
Note that I didn't require hex digits, in order to cause any trailing "padding" spaces on the last line of a paragraph to expand in proportion.
Again, please check the performance against the zip version above!

Parse WAV file header

I am writing a program to parse a WAV file header and print the information to the screen. Before writing the program i am doing some research
hexdump -n 48 sound_file_8000hz.wav
00000000 52 49 46 46 bc af 01 00 57 41 56 45 66 6d 74 20 |RIFF....WAVEfmt |
00000010 10 00 00 00 01 00 01 00 >40 1f 00 00< 40 1f 00 00 |........#...#...|
00000020 01 00 08 00 64 61 74 61 98 af 01 00 81 80 81 80 |....data........|
hexdump -n 48 sound_file_44100hz.wav
00000000 52 49 46 46 c4 ea 1a 00 57 41 56 45 66 6d 74 20 |RIFF....WAVEfmt |
00000010 10 00 00 00 01 00 02 00 >44 ac 00 00< 10 b1 02 00 |........D.......|
00000020 04 00 10 00 64 61 74 61 a0 ea 1a 00 00 00 00 00 |....data........|
The part between > and < in both files are the sample rate.
How does "40 1f 00 00" translate to 8000Hz and "44 ac 00 00" to 44100Hz? Information like number of channels and audio format can be read directly from the dump. I found a Python
script called WavHeader that parses the sample rate correctly in both files. This is the core of the script:
bufHeader = fileIn.read(38)
# Verify that the correct identifiers are present
if (bufHeader[0:4] != "RIFF") or \
(bufHeader[12:16] != "fmt "):
logging.debug("Input file not a standard WAV file")
return
# endif
stHeaderFields = {'ChunkSize' : 0, 'Format' : '',
'Subchunk1Size' : 0, 'AudioFormat' : 0,
'NumChannels' : 0, 'SampleRate' : 0,
'ByteRate' : 0, 'BlockAlign' : 0,
'BitsPerSample' : 0, 'Filename': ''}
# Parse fields
stHeaderFields['ChunkSize'] = struct.unpack('<L', bufHeader[4:8])[0]
stHeaderFields['Format'] = bufHeader[8:12]
stHeaderFields['Subchunk1Size'] = struct.unpack('<L', bufHeader[16:20])[0]
stHeaderFields['AudioFormat'] = struct.unpack('<H', bufHeader[20:22])[0]
stHeaderFields['NumChannels'] = struct.unpack('<H', bufHeader[22:24])[0]
stHeaderFields['SampleRate'] = struct.unpack('<L', bufHeader[24:28])[0]
stHeaderFields['ByteRate'] = struct.unpack('<L', bufHeader[28:32])[0]
stHeaderFields['BlockAlign'] = struct.unpack('<H', bufHeader[32:34])[0]
stHeaderFields['BitsPerSample'] = struct.unpack('<H', bufHeader[34:36])[0]
I do not understand how this can extract the corret sample rates, when i cannot using hexdump?
I am using information about the WAV file format from this page:
https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
The "40 1F 00 00" bytes equate to an integer whose hexadecimal value is 00001F40 (remember that the integers are stored in a WAVE file in the little endian format). A value of 00001F40 in hexadecimal equates to a decimal value of 8000.
Similarly, the "44 AC 00 00" bytes equate to an integer whose hexadecimal value is 0000AC44. A value of 0000AC44 in hexadecimal equates to a decimal value of 44100.
They're little-endian.
>>> 0x00001f40
8000
>>> 0x0000ac44
44100

Categories

Resources