reading a WAV file from TIMIT database in python - python

I'm trying to read a wav file from the TIMIT database in python but I get an error:
When I'm using wave:
wave.Error: file does not start with RIFF id
When I'm using scipy:
ValueError: File format b'NIST'... not understood.
and when I'm using librosa, the program got stuck.
I tried to convert it to wav using sox:
cmd = "sox " + wav_file + " -t wav " + new_wav
subprocess.call(cmd, shell=True)
and it didn't help. I saw an old answer referencing to the package scikits.audiolab but it looks like it is no longer supported.
How can I read these file to get a ndarray of the data?
Thanks

Your file is not a WAV file. Apparently it is a NIST SPHERE file. From the LDC web page: "Many LDC corpora contain speech files in NIST SPHERE format." According to the description of the NIST File Format, the first four characters of the file are NIST. That's what the scipy error is telling you: it doesn't know how to read a file that begins with NIST.
I suspect you'll have to convert the file to WAV if you want to read the file with any of the libraries that you tried. To force the conversion to WAV using the program sph2pipe, use the command option -f wav (or equivalently, -f rif), e.g.
sph2pipe -f wav input.sph output.wav

issue this from command line to verify its a wav file ... or not
xxd -b myaudiofile.wav | head
if its wav format it will appear something like
00000000: 01010010 01001001 01000110 01000110 10111100 10101111 RIFF..
00000006: 00000001 00000000 01010111 01000001 01010110 01000101 ..WAVE
0000000c: 01100110 01101101 01110100 00100000 00010000 00000000 fmt ..
00000012: 00000000 00000000 00000001 00000000 00000001 00000000 ......
00000018: 01000000 00011111 00000000 00000000 01000000 00011111 #...#.
0000001e: 00000000 00000000 00000001 00000000 00001000 00000000 ......
00000024: 01100100 01100001 01110100 01100001 10011000 10101111 data..
0000002a: 00000001 00000000 10000001 10000000 10000001 10000000 ......
00000030: 10000001 10000000 10000001 10000000 10000001 10000000 ......
00000036: 10000001 10000000 10000001 10000000 10000001 10000000 ......
here is yet another way to display contents of a binary file like a WAV
od -A x -t x1z -v audio_util_test_file_custom.wav | head
000000 52 49 46 46 24 80 00 00 57 41 56 45 66 6d 74 20 >RIFF$...WAVEfmt <
000010 10 00 00 00 01 00 01 00 44 ac 00 00 88 58 01 00 >........D....X..<
000020 02 00 10 00 64 61 74 61 00 80 00 00 00 00 78 05 >....data......x.<
000030 ed 0a 5e 10 c6 15 25 1b 77 20 ba 25 eb 2a 08 30 >..^...%.w .%.*.0<
000040 0e 35 fc 39 cf 3e 84 43 1a 48 8e 4c de 50 08 55 >.5.9.>.C.H.L.P.U<
000050 0b 59 e4 5c 91 60 12 64 63 67 85 6a 74 6d 30 70 >.Y.\.`.dcg.jtm0p<
000060 b8 72 0a 75 25 77 09 79 b4 7a 26 7c 5d 7d 5a 7e >.r.u%w.y.z&|]}Z~<
000070 1c 7f a3 7f ee 7f fd 7f d0 7f 67 7f c3 7e e3 7d >..........g..~.}<
000080 c9 7c 74 7b e6 79 1e 78 1f 76 e8 73 7b 71 d9 6e >.|t{.y.x.v.s{q.n<
000090 03 6c fa 68 c1 65 57 62 c0 5e fd 5a 0f 57 f8 52 >.l.h.eWb.^.Z.W.R<
notice the wav file begins with the characters RIFF
which is the mandatory indicator the file is using wav codec ... if your system (I'm on linux) does not have above command line utility : xxd then use any hex editor like wxHexEditor to similarily examine your wav file to confirm you see the RIFF ... if no RIFF then its simply not a wav file
Here are details of wav format specs
http://soundfile.sapp.org/doc/WaveFormat/
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
http://unusedino.de/ec64/technical/formats/wav.html
http://www.drdobbs.com/database/inside-the-riff-specification/184409308
https://www.gamedev.net/articles/programming/general-and-gameplay-programming/loading-a-wave-file-r709
http://www.topherlee.com/software/pcm-tut-wavformat.html
http://www.labbookpages.co.uk/audio/javaWavFiles.html
http://www.johnloomis.org/cpe102/asgn/asgn1/riff.html
http://nagasm.org/ASL/sound05/

If you want a generic code that works for every wav file inside the folder run:
forfiles /s /m *.wav /c "cmd /c sph2pipe -f wav #file #fnameRIFF.wav"
It search for every wav file that can find and create a wav file that both scipy and wave can read with the name < base_name >RIFF.wav

I have written a python script which will convert all the .WAV files in NIST format spoken by all speakers from all dialects to .wav files which ca
n be played on your system.
Note: All the dialects folders are present in ./TIMIT/TRAIN/ . You may have to change the dialects_path according to your project structure(or if you are on Windows)
from sphfile import SPHFile
dialects_path = "./TIMIT/TRAIN/"
for dialect in dialects:
dialect_path = dialects_path + dialect
speakers = os.listdir(path = dialect_path)
for speaker in speakers:
speaker_path = os.path.join(dialect_path,speaker)
speaker_recordings = os.listdir(path = speaker_path)
wav_files = glob.glob(speaker_path + '/*.WAV')
for wav_file in wav_files:
sph = SPHFile(wav_file)
txt_file = ""
txt_file = wav_file[:-3] + "TXT"
f = open(txt_file,'r')
for line in f:
words = line.split(" ")
start_time = (int(words[0])/16000)
end_time = (int(words[1])/16000)
print("writing file ", wav_file)
sph.write_wav(wav_file.replace(".WAV",".wav"),start_time,end_time)

Please use sounddevice and soundfile to obtain the
numpy array data (and playback) using the following code:
import matplotlib.pyplot as plt
import soundfile as sf
import sounddevice as sd
# https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav
data, fs = sf.read('LDC93S1.wav')
print(data.shape,fs)
sd.play(data, fs, blocking=True)
plt.plot(data)
plt.show()
Output
(46797,) 16000
A sample TIMIT database wav file: https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav

Sometimes this can be caused by the incorrect method of extracting a 7zip file. I had a similar issue. I sorted out this issue by extracting the dataset using 7z x <datasetname>.7z

Related

Unknown event in MIDI file

As I've posted about before, I am writing a MIDI parser in Python. I am encountering an error where my parser is getting stuck because it's trying to read an event called 2a, but such an event does not exist. below is an excerpt from the MIDI file in question:
5d7f 00b5 5d7f 00b6 5d7f 00b1 5d00 00b9
5d00 8356 9923 7f00 2a44 0192 367f 0091
237f 0099 4640 0092 2f7c 0099 3f53 0b3f
I have parsed the file by hand, and I am getting stuck in the same spot as my parser! The MIDI file plays, so I know it's valid, but I'm certain that I am reading the events wrong.
The Standard MIDI Files 1.0 specification says:
Running status is used: status bytes of MIDI channel messages may be omitted if the preceding event is a MIDI channel message with the same status. The first event in each MTrk chunk must specify status. Delta-time is not considered an event itself: it is an integral part of the syntax for an MTrk event. Notice that running status occurs across delta-times.
Your excerpt would be decoded as follows:
delta <- event ------->
time status parameters
----- ------ ----------
... 5d 7f
00 b5 5d 7f
00 b6 5d 7f
00 b1 5d 00
00 b9 5d 00
83 56 99 23 7f
00 2a 44
01 92 36 7f
00 91 23 7f
00 99 46 40
00 92 2f 7c
00 99 3f 53
0b 3f ...

Difference in result while reading same file with node and python

I have been trying to read the contents of the genesis.block given in this file of the Node SDK in Hyperledger Fabric using Python. However, whenever I try to read the file with Python by using
data = open("twoorgs.genesis.block").read()
The value of the data variable is as follows:
>>> data
'\n'
With nodejs using fs.readFileSync() I obtain an instance of Buffer() for the same file.
var data = fs.readFileSync('./twoorgs.genesis.block');
The result is
> data
<Buffer 0a 22 1a 20 49 63 63 ac 9c 9f 3e 48 2c 2c 6b 48 2b 1f 8b 18 6f a9 db ac 45 07 29 ee c0 bf ac 34 99 9e c2 56 12 e1 84 01 0a dd 84 01 0a d9 84 01 0a 79 ... >
How can I read this file successfully using Python?
You file has a 1a in it. This is Ctrl-Z, which is an end of file on Windows.
So try binary mode like:
data = open("twoorgs.genesis.block", 'rb').read()

Speed up python code

I have some text file in following format (network traffic collected by tcpdump):
1505372009.023944 00:1e:4c:72:b8:ae > 00:23:f8:93:c1:af, ethertype IPv4 (0x0800), length 97: (tos 0x0, ttl 64, id 5134, offset 0, flags [DF], proto TCP (6), length 83)
192.168.1.53.36062 > 74.125.143.139.443: Flags [P.], cksum 0x67fd (correct), seq 1255996541:1255996572, ack 1577943820, win 384, options [nop,nop,TS val 356377 ecr 746170020], length 31
0x0000: 0023 f893 c1af 001e 4c72 b8ae 0800 4500 .#......Lr....E.
0x0010: 0053 140e 4000 4006 8ab1 c0a8 0135 4a7d .S..#.#......5J}
0x0020: 8f8b 8cde 01bb 4adc fc7d 5e0d 830c 8018 ......J..}^.....
0x0030: 0180 67fd 0000 0101 080a 0005 7019 2c79 ..g.........p.,y
0x0040: a6a4 1503 0300 1a00 0000 0000 0000 04d1 ................
0x0050: c300 9119 6946 698c 67ac 47a9 368a 1748 ....iFi.g.G.6..H
0x0060: 1c .
and want to change it to:
1505372009.023944
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .
Here is what I have done:
import re
regexp_time =re.compile("\d\d\d\d\d\d\d\d\d\d.\d\d\d\d\d\d+")
regexp_hex = re.compile("(\t0x\d+:\s+)([0-9a-f ]+)+ ")
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt2.txt','w') as output:
for line in input:
if regexp_time.match(line):
output.write ("%s\n" % (line.split()[0]))
elif regexp_hex.match(line):
words = re.split(r'\s{2,}', line)
bytes=""
for byte in words[1].split():
if len(byte) == 4:
bytes += "%s%s %s%s "%(byte[0],byte[1],byte[2],byte[3])
elif len(byte) == 2:
bytes += "%s%s "%(byte[0],byte[1])
output.write ("%s %s %s \n" % (words[0].replace("0x","00"),"{:<47}".format (bytes),words[2].replace("\n","")))
input.close()
output.close()
Could some one help me in speed up?
Edit
Here is the new version of code depends on #Austin answer, It really speed up the code.
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt1.txt','w') as output:
for line in input:
if line[0].isdigit():
output.write (line[:16])
output.write ('\n')
elif line.startswith("\t0x"):#(Since there is line which is not hex and not start with timestamp I should check this as well)
offset = line[:10] # " 0x0000: "
words = line[10:51] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[51:] # " .#......Lr....E."
line = [offset.replace('x', '0', 1)]
for a,b,c,d,space in zip (words[0::5],words[1::5],words[2::5],words[3::5],words[4::5]):
line.append(a)
line.append(b)
line.append(space)
line.append(c)
line.append(d)
line.append(space)
line.append (chars)
output.write (''.join (line))
input.close()
output.close()
Here is the result:
1505372009.02394
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .
You haven't specified anything else about your file format, including what if any lines appear between blocks of packet data. So I'm going to assume that you just have paragraphs like the one you show, jammed together.
The best way to speed up something like this is to reduce the extra operations. You have a bunch! For example:
You use a regex to match the "start" line.
You use a split to extract the timestamp from the start line.
You use a %-format operator to write the timestamp out.
You use a different regex to match a "hex" line.
You use more than one split to parse the hex line.
You use various formatting operators to output the hex line.
If you're going to use regular expression matching, then I think you should just do one match. Create an alternate pattern (like a|b) that describes both lines. Use match.lastgroup or .lastindex to decide what got matched.
But your lines are so different that I don't think a regex is needed. Basically, you can decide what sort of line you have by looking at the very first character:
if line[0].isdigit():
# This is a timestamp line
else:
# This is a hex line
For timestamp processing, all you want to do is print out the 17 characters at the start of the line: 11 digits, a dot, and 6 more digits. So do that:
if line[0].isdigit():
output.write(line[:17], '\n')
For hex line processing, you want to make two kinds of changes: you want to replace the 'x' in the hex offset with a zero. That's easy:
hexline = line.replace('x', '0', 1) # Note: 1 replacement only!
Then, you want to insert spaces between the groups of 4 hex digits, and pad the short lines so the character display appears in the same column.
This is a place where regular expression replacement might help you. There's a limited number of occurrences, but it may be that the overhead of the Cpython interpreter costs more than the setup and teardown for a regex replacement. You probably should do some profiling on this.
That said, you can split the line into three parts. It's important to capture the trailing space on the middle part, though:
offset = line[:13] # " 0x0000: "
words = line[13:53] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[53:] # " .#......Lr....E."
You already know how to replace the 'x' in the offset, and there's nothing to be done to the chars portion of the line. So we'll leave those alone. The remaining task is to spread out the characters in the
words string. You can do that in various ways, but it seems easy to process the characters in chunks of 5 (4 hex digits plus a trailing space).
We can do this because we captured the trailing space on the words part. If not, you might have to use itertools.zip_longest(..., fill_value=''), but it's probably easier just to grab one more character.
With that done, you can do:
for a,b,c,d,space in zip(words[0::5], words[1::5], words[2::5], words[3::5], words[4::5]):
output.write(a, b, space, c, d, space)
Alternatively, instead of making all those calls you could accumulate the characters in a buffer and then write the buffer one time. Something like:
line = [offset]
for ...:
line.extend(a, b, space, c, d, space)
line.append(chars)
line.append('\n')
output.write(''.join(line))
That's fairly straightforward, but like I said, it may not perform quite as well as a regular-expression replacement. That would be due to the regex code running as "C" rather than python bytecode. So you should compare it against a pattern replacement like:
words = re.sub(r'(..)(..) ', '\1 \2 ', words)
Note that I didn't require hex digits, in order to cause any trailing "padding" spaces on the last line of a paragraph to expand in proportion.
Again, please check the performance against the zip version above!

Write null bytes in a file instead of correct strings

I have a python script that process a data file :
out = open('result/process/'+name+'.res','w')
out.write("source,rssi,lqi,packetId,run,counter\n")
f = open('result/resultat0.res','r')
for ligne in [x for x in f if x != '']:
chaine = ligne.rstrip('\n')
tmp = chaine.split(',')
if (len(tmp) == 6 ):
out.write(','.join(tmp)+"\n")
f.close()
The complete code is here
I use this script on several computers and the behavior is not the same.
On the first computer, with python 2.6.6, the result is what I expect.
However, on the others (python 2.6.6, 3.3.2, 2.7.5) the write method of file object puts null bytes instead of the values I want during the most part of the processing. I get this result :
$ hexdump -C result/process/1.res
00000000 73 6f 75 72 63 65 2c 72 73 73 69 2c 6c 71 69 2c |source,rssi,lqi,|
00000010 70 61 63 6b 65 74 49 64 2c 72 75 6e 2c 63 6f 75 |packetId,run,cou|
00000020 6e 74 65 72 0a 00 00 00 00 00 00 00 00 00 00 00 |nter............|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0003a130 00 00 00 00 00 00 00 00 00 00 31 33 2c 36 35 2c |..........13,65,|
0003a140 31 34 2c 38 2c 39 38 2c 31 33 31 34 32 0a 31 32 |14,8,98,13142.12|
0003a150 2c 34 37 2c 31 37 2c 38 2c 39 38 2c 31 33 31 34 |,47,17,8,98,1314|
0003a160 33 0a 33 2c 34 35 2c 31 38 2c 38 2c 39 38 2c 31 |3.3,45,18,8,98,1|
0003a170 33 31 34 34 0a 31 31 2c 38 2c 32 33 2c 38 2c 39 |3144.11,8,23,8,9|
0003a180 38 2c 31 33 31 34 35 0a 39 2c 32 30 2c 32 32 2c |8,13145.9,20,22,|
Have you an idea how to resolve this problem please ?
With the following considerations:
In over a decade of programming python, I've never come across a compelling reason to use global. Pass arguments to functions instead.
For ensuring files are closed when finished with, use the with statement.
Here's an (untested) attempt at refactoring your code for sanity, assumes that you have enough memory available to hold all of the lines under a particular identifier.
If you have null bytes in your result files after this refactoring then we have reasonable basis to proceed with debugging.
import os
import re
from contextlib import closing
def list_files_to_process(directory='results'):
"""
Return a list of files from directory where the file extension is '.res',
case insensitive.
"""
results = []
for filename in os.listdir(directory):
filepath = os.path.join(directory,filename)
if os.path.isfile(filepath) and filename.lower().endswith('.res'):
results.append(filepath)
return results
def group_lines(sequence):
"""
Generator, process a sequence of lines, separated by a particular line.
Yields batches of lines along with the id from the separator.
"""
separator = re.compile('^A:(?P<id>\d+):$')
batch = []
batch_id = None
for line in sequence:
if not line: # Ignore blanks
continue
m = separator.match(line):
if m is not None:
if batch_id is not None or len(batch) > 0:
yield (batch_id,batch)
batch_id = m.group('id')
batch = []
else:
batch.append(line)
if batch_id is not None or len(batch) > 0:
yield (batch_id,batch)
def filename_for_results(batch_id,result_directory):
"""
Return an appropriate filename for a batch_id under the result directory
"""
return os.path.join(result_directory,"results-%s.res" % (batch_id,))
def open_result_file(filename,header="source,rssi,lqi,packetId,run,counter"):
"""
Return an open file object in append mode, having appended a header if
filename doesn't exist or is empty
"""
if os.path.exists(filename) and os.path.getsize(filename) > 0:
# No need to write header
return open(filename,'a')
else:
f = open(filename,'a')
f.write(header + '\n')
return f
def process_file(filename,result_directory='results/processed'):
"""
Open filename and process it's contents. Uses group_lines() to group
lines into different files based upon specific line acting as a
content separator.
"""
error_filename = filename_for_results('error',result_directory)
with open(filename,'r') as in_file, open(error_filename,'w') as error_out:
for batch_id, lines in group_lines(in_file):
if len(lines) == 0:
error_out.write("Received batch %r with 0 lines" % (batch_id,))
continue
out_filename = filename_for_results(batch_id,result_directory)
with closing(open_result_file(out_filename)) as out_file:
for line in lines:
if line.startswith('L') and line.endswith('E') and line.count(',') == 5:
line = line.lstrip('L').rstrip('E')
out_file.write(line + '\n')
else:
error_out.write("Unknown line, batch=%r: %r\n" %(batch_id,line))
if __name__ == '__main__':
files = list_files_to_process()
for filename in files:
print "Processing %s" % (filename,)
process_file(filename)

Parse WAV file header

I am writing a program to parse a WAV file header and print the information to the screen. Before writing the program i am doing some research
hexdump -n 48 sound_file_8000hz.wav
00000000 52 49 46 46 bc af 01 00 57 41 56 45 66 6d 74 20 |RIFF....WAVEfmt |
00000010 10 00 00 00 01 00 01 00 >40 1f 00 00< 40 1f 00 00 |........#...#...|
00000020 01 00 08 00 64 61 74 61 98 af 01 00 81 80 81 80 |....data........|
hexdump -n 48 sound_file_44100hz.wav
00000000 52 49 46 46 c4 ea 1a 00 57 41 56 45 66 6d 74 20 |RIFF....WAVEfmt |
00000010 10 00 00 00 01 00 02 00 >44 ac 00 00< 10 b1 02 00 |........D.......|
00000020 04 00 10 00 64 61 74 61 a0 ea 1a 00 00 00 00 00 |....data........|
The part between > and < in both files are the sample rate.
How does "40 1f 00 00" translate to 8000Hz and "44 ac 00 00" to 44100Hz? Information like number of channels and audio format can be read directly from the dump. I found a Python
script called WavHeader that parses the sample rate correctly in both files. This is the core of the script:
bufHeader = fileIn.read(38)
# Verify that the correct identifiers are present
if (bufHeader[0:4] != "RIFF") or \
(bufHeader[12:16] != "fmt "):
logging.debug("Input file not a standard WAV file")
return
# endif
stHeaderFields = {'ChunkSize' : 0, 'Format' : '',
'Subchunk1Size' : 0, 'AudioFormat' : 0,
'NumChannels' : 0, 'SampleRate' : 0,
'ByteRate' : 0, 'BlockAlign' : 0,
'BitsPerSample' : 0, 'Filename': ''}
# Parse fields
stHeaderFields['ChunkSize'] = struct.unpack('<L', bufHeader[4:8])[0]
stHeaderFields['Format'] = bufHeader[8:12]
stHeaderFields['Subchunk1Size'] = struct.unpack('<L', bufHeader[16:20])[0]
stHeaderFields['AudioFormat'] = struct.unpack('<H', bufHeader[20:22])[0]
stHeaderFields['NumChannels'] = struct.unpack('<H', bufHeader[22:24])[0]
stHeaderFields['SampleRate'] = struct.unpack('<L', bufHeader[24:28])[0]
stHeaderFields['ByteRate'] = struct.unpack('<L', bufHeader[28:32])[0]
stHeaderFields['BlockAlign'] = struct.unpack('<H', bufHeader[32:34])[0]
stHeaderFields['BitsPerSample'] = struct.unpack('<H', bufHeader[34:36])[0]
I do not understand how this can extract the corret sample rates, when i cannot using hexdump?
I am using information about the WAV file format from this page:
https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
The "40 1F 00 00" bytes equate to an integer whose hexadecimal value is 00001F40 (remember that the integers are stored in a WAVE file in the little endian format). A value of 00001F40 in hexadecimal equates to a decimal value of 8000.
Similarly, the "44 AC 00 00" bytes equate to an integer whose hexadecimal value is 0000AC44. A value of 0000AC44 in hexadecimal equates to a decimal value of 44100.
They're little-endian.
>>> 0x00001f40
8000
>>> 0x0000ac44
44100

Categories

Resources