Difference in result while reading same file with node and python

Difference in result while reading same file with node and python - python

I have been trying to read the contents of the genesis.block given in this file of the Node SDK in Hyperledger Fabric using Python. However, whenever I try to read the file with Python by using
data = open("twoorgs.genesis.block").read()
The value of the data variable is as follows:
>>> data
'\n'
With nodejs using fs.readFileSync() I obtain an instance of Buffer() for the same file.
var data = fs.readFileSync('./twoorgs.genesis.block');
The result is
> data
<Buffer 0a 22 1a 20 49 63 63 ac 9c 9f 3e 48 2c 2c 6b 48 2b 1f 8b 18 6f a9 db ac 45 07 29 ee c0 bf ac 34 99 9e c2 56 12 e1 84 01 0a dd 84 01 0a d9 84 01 0a 79 ... >
How can I read this file successfully using Python?

You file has a 1a in it. This is Ctrl-Z, which is an end of file on Windows.
So try binary mode like:
data = open("twoorgs.genesis.block", 'rb').read()

Related

Unknown event in MIDI file

As I've posted about before, I am writing a MIDI parser in Python. I am encountering an error where my parser is getting stuck because it's trying to read an event called 2a, but such an event does not exist. below is an excerpt from the MIDI file in question:
5d7f 00b5 5d7f 00b6 5d7f 00b1 5d00 00b9
5d00 8356 9923 7f00 2a44 0192 367f 0091
237f 0099 4640 0092 2f7c 0099 3f53 0b3f
I have parsed the file by hand, and I am getting stuck in the same spot as my parser! The MIDI file plays, so I know it's valid, but I'm certain that I am reading the events wrong.

The Standard MIDI Files 1.0 specification says:
Running status is used: status bytes of MIDI channel messages may be omitted if the preceding event is a MIDI channel message with the same status. The first event in each MTrk chunk must specify status. Delta-time is not considered an event itself: it is an integral part of the syntax for an MTrk event. Notice that running status occurs across delta-times.
Your excerpt would be decoded as follows:
delta <- event ------->
time status parameters
----- ------ ----------
... 5d 7f
00 b5 5d 7f
00 b6 5d 7f
00 b1 5d 00
00 b9 5d 00
83 56 99 23 7f
00 2a 44
01 92 36 7f
00 91 23 7f
00 99 46 40
00 92 2f 7c
00 99 3f 53
0b 3f ...

reading a WAV file from TIMIT database in python

I'm trying to read a wav file from the TIMIT database in python but I get an error:
When I'm using wave:
wave.Error: file does not start with RIFF id
When I'm using scipy:
ValueError: File format b'NIST'... not understood.
and when I'm using librosa, the program got stuck.
I tried to convert it to wav using sox:
cmd = "sox " + wav_file + " -t wav " + new_wav
subprocess.call(cmd, shell=True)
and it didn't help. I saw an old answer referencing to the package scikits.audiolab but it looks like it is no longer supported.
How can I read these file to get a ndarray of the data?
Thanks

Your file is not a WAV file. Apparently it is a NIST SPHERE file. From the LDC web page: "Many LDC corpora contain speech files in NIST SPHERE format." According to the description of the NIST File Format, the first four characters of the file are NIST. That's what the scipy error is telling you: it doesn't know how to read a file that begins with NIST.
I suspect you'll have to convert the file to WAV if you want to read the file with any of the libraries that you tried. To force the conversion to WAV using the program sph2pipe, use the command option -f wav (or equivalently, -f rif), e.g.
sph2pipe -f wav input.sph output.wav

issue this from command line to verify its a wav file ... or not
xxd -b myaudiofile.wav | head
if its wav format it will appear something like
00000000: 01010010 01001001 01000110 01000110 10111100 10101111 RIFF..
00000006: 00000001 00000000 01010111 01000001 01010110 01000101 ..WAVE
0000000c: 01100110 01101101 01110100 00100000 00010000 00000000 fmt ..
00000012: 00000000 00000000 00000001 00000000 00000001 00000000 ......
00000018: 01000000 00011111 00000000 00000000 01000000 00011111 #...#.
0000001e: 00000000 00000000 00000001 00000000 00001000 00000000 ......
00000024: 01100100 01100001 01110100 01100001 10011000 10101111 data..
0000002a: 00000001 00000000 10000001 10000000 10000001 10000000 ......
00000030: 10000001 10000000 10000001 10000000 10000001 10000000 ......
00000036: 10000001 10000000 10000001 10000000 10000001 10000000 ......
here is yet another way to display contents of a binary file like a WAV
od -A x -t x1z -v audio_util_test_file_custom.wav | head
000000 52 49 46 46 24 80 00 00 57 41 56 45 66 6d 74 20 >RIFF$...WAVEfmt <
000010 10 00 00 00 01 00 01 00 44 ac 00 00 88 58 01 00 >........D....X..<
000020 02 00 10 00 64 61 74 61 00 80 00 00 00 00 78 05 >....data......x.<
000030 ed 0a 5e 10 c6 15 25 1b 77 20 ba 25 eb 2a 08 30 >..^...%.w .%.*.0<
000040 0e 35 fc 39 cf 3e 84 43 1a 48 8e 4c de 50 08 55 >.5.9.>.C.H.L.P.U<
000050 0b 59 e4 5c 91 60 12 64 63 67 85 6a 74 6d 30 70 >.Y.\.`.dcg.jtm0p<
000060 b8 72 0a 75 25 77 09 79 b4 7a 26 7c 5d 7d 5a 7e >.r.u%w.y.z&|]}Z~<
000070 1c 7f a3 7f ee 7f fd 7f d0 7f 67 7f c3 7e e3 7d >..........g..~.}<
000080 c9 7c 74 7b e6 79 1e 78 1f 76 e8 73 7b 71 d9 6e >.|t{.y.x.v.s{q.n<
000090 03 6c fa 68 c1 65 57 62 c0 5e fd 5a 0f 57 f8 52 >.l.h.eWb.^.Z.W.R<
notice the wav file begins with the characters RIFF
which is the mandatory indicator the file is using wav codec ... if your system (I'm on linux) does not have above command line utility : xxd then use any hex editor like wxHexEditor to similarily examine your wav file to confirm you see the RIFF ... if no RIFF then its simply not a wav file
Here are details of wav format specs
http://soundfile.sapp.org/doc/WaveFormat/
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
http://unusedino.de/ec64/technical/formats/wav.html
http://www.drdobbs.com/database/inside-the-riff-specification/184409308
https://www.gamedev.net/articles/programming/general-and-gameplay-programming/loading-a-wave-file-r709
http://www.topherlee.com/software/pcm-tut-wavformat.html
http://www.labbookpages.co.uk/audio/javaWavFiles.html
http://www.johnloomis.org/cpe102/asgn/asgn1/riff.html
http://nagasm.org/ASL/sound05/

If you want a generic code that works for every wav file inside the folder run:
forfiles /s /m *.wav /c "cmd /c sph2pipe -f wav #file #fnameRIFF.wav"
It search for every wav file that can find and create a wav file that both scipy and wave can read with the name < base_name >RIFF.wav

I have written a python script which will convert all the .WAV files in NIST format spoken by all speakers from all dialects to .wav files which ca
n be played on your system.
Note: All the dialects folders are present in ./TIMIT/TRAIN/ . You may have to change the dialects_path according to your project structure(or if you are on Windows)
from sphfile import SPHFile
dialects_path = "./TIMIT/TRAIN/"
for dialect in dialects:
dialect_path = dialects_path + dialect
speakers = os.listdir(path = dialect_path)
for speaker in speakers:
speaker_path = os.path.join(dialect_path,speaker)
speaker_recordings = os.listdir(path = speaker_path)
wav_files = glob.glob(speaker_path + '/*.WAV')
for wav_file in wav_files:
sph = SPHFile(wav_file)
txt_file = ""
txt_file = wav_file[:-3] + "TXT"
f = open(txt_file,'r')
for line in f:
words = line.split(" ")
start_time = (int(words[0])/16000)
end_time = (int(words[1])/16000)
print("writing file ", wav_file)
sph.write_wav(wav_file.replace(".WAV",".wav"),start_time,end_time)

Please use sounddevice and soundfile to obtain the
numpy array data (and playback) using the following code:
import matplotlib.pyplot as plt
import soundfile as sf
import sounddevice as sd
# https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav
data, fs = sf.read('LDC93S1.wav')
print(data.shape,fs)
sd.play(data, fs, blocking=True)
plt.plot(data)
plt.show()
Output
(46797,) 16000
A sample TIMIT database wav file: https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav

Sometimes this can be caused by the incorrect method of extracting a 7zip file. I had a similar issue. I sorted out this issue by extracting the dataset using 7z x <datasetname>.7z

Python: convert hex bytestream to “int16"

So I'm working with incoming audio from Watson Text to Speech. I want to play the sound immediately when data arrives to Python with a websocket from nodeJS.
This is a example of data I'm sending with the websocket:
<Buffer e3 f8 28 f9 fa f9 5d fb 6c fc a6 fd 12 ff b3 00 b8 02 93 04 42 06 5b 07 e4 07 af 08 18 0a 95 0b 01 0d a2 0e a4 10 d7 12 f4 12 84 12 39 13 b0 12 3b 13 ... >
So the data arrives as a hex bytestream and I try to convert it to something that Sounddevice can read/play. (See documentation: The types 'float32', 'int32', 'int16', 'int8' and 'uint8' can be used for all streams and functions.) But how can I convert this?
I already tried something, but when I run my code I only hear some noise, nothing recognizable.
Here you can read some parts of my code:
def onMessage(self, payload, isBinary):
a = payload.encode('hex')
queue.put(a)
After I receive the bytesstream and convert to hex, I try to send the incoming bytestream to Sounddevice:
def stream_audio():
with sd.OutputStream(channels=1, samplerate=24000, dtype='int16', callback=callback):
sd.sleep(int(20 * 1000))
def callback(outdata, frames, time, status):
global reststuff, i, string
LENGTH = frames
while len(reststuff) < LENGTH:
a = queue.get()
reststuff += a
returnstring = reststuff[:LENGTH]
reststuff = reststuff[LENGTH:]
for char in returnstring:
i += 1
string += char
if i % 2 == 0:
print string
outdata[:] = int(string, 16)
string = ""

look at your stream of data:
e3 f8 28 f9 fa f9 5d fb 6c fc a6 fd 12 ff b3 00
b8 02 93 04 42 06 5b 07 e4 07 af 08 18 0a 95 0b
01 0d a2 0e a4 10 d7 12 f4 12 84 12 39 13 b0 12
3b 13
you see here that every two bytes the second one is starting with e/f/0/1 which means near zero (in two's complement).
So that's your most significant bytes, so your stream is little-endian!
you should consider that in your conversion.
If I have more data I would have tested but this is worth some miliseconds!

Identify the contents a file through a program in python [duplicate]

This question already has answers here:
Tools to help reverse engineer binary file formats
(9 answers)
Closed 6 years ago.
I have a file here. To me it appears it is a binary file. This is raw file and I believe that it has the stock information in OHLCV (Open, High, Low, Close, Volume). Besides it may also have some text.
One of the entries that I could possibly have for OHLCV is
464.95, 468.3, 460, 465.65, 3957854
This is the code that I have tried. I dont fully understand about ASCII and Unicode.
input_file = "00063181.dat" # tata motors
with open(input_file, "rb") as fh:
buf = fh.read()
output_l = list(map(int , buf))
print (output_l)
My Doubt: How do I decode this file and make sense out of it? Is there any way for me to read this file through a program written in python and separate the text from int/float? I am using Python 3 and Win 10 64 bit.

You're looking to reverse engineer the structure of a binary file using Python. Since you've declared that the file is binary, it may prove difficult. You're going to need to examine the contents of the file and use your best intuition to try to infer the structure. The first thing you're going to want is a way to display each of the bytes of the file an a way that will help you understand the meaning.
Fortunately, someone has already written a tool to do this, hexdump. Install that package using pip.
The function you need from that package is hexdump, so let's import it the package and get help on the function.
>>> import hexdump
>>> help(hexdump.hexdump)
Help on function hexdump in module hexdump:
hexdump(data, result='print')
Transform binary data to the hex dump text format:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[x] data argument as a binary string
[x] data argument as a file like object
Returns result depending on the `result` argument:
'print' - prints line by line
'return' - returns single string
'generator' - returns generator that produces lines
Now you can start to explore the contents of your file. Use the slice operator to do it in chunks. For example, to render the contents of the first 1KB of your file:
>>> hexdump.hexdump(buf[:1024])
00000000: C3 8E C2 8F 22 13 C2 AA 66 2A 22 47 C3 94 C3 AA ...."...f*"G....
00000010: C3 89 C3 A0 C3 B1 C3 91 6A C2 A4 C3 BF 3C C2 AA ........j....<..
00000020: C2 91 73 C3 85 46 57 47 C2 88 C3 99 C2 B6 3E 2D ..s..FWG......>-
00000030: C3 BA 69 10 C2 93 C3 94 38 C3 81 7A 6A 43 30 7C ..i.....8..zjC0|
00000040: C3 BB C2 AA 01 2D C2 97 C3 83 C3 88 64 14 C3 9C .....-......d...
00000050: C2 AB C2 AA C3 A2 74 C2 85 5D C3 97 4E 64 68 C3 ......t..]..Ndh.
...
000003C0: 42 C2 8F 06 7F 12 33 7F 79 1E 2C 2A 0F C3 92 36 B.....3.y.,*...6
000003D0: C3 A6 C2 96 C2 93 C2 8B 43 C2 9F 4C C2 95 48 24 ........C..L..H$
000003E0: C2 B3 C2 82 26 C3 88 C3 BD C3 96 12 1E 5E 18 2E ....&........^..
000003F0: 37 C3 A7 C2 87 C3 AE 00 4F 3F C2 9C C3 A8 1C C2 7.......O?......
Hexdump has a nice property of rendering the byte position, the hex code, and then (if possible) the printable form of the character on the right.
Hopefully some of your text values will be visible there and that will give some clue as to how to reverse engineer your file.
Once you've started to determine how your file is structured, you can use the various string operators to manipulate your data. For example, if you find that your file is split into sections by the null byte (b'\x00'), you can get those sections thus:
>>> sections = buf.split(b'\x00')
There are a lot of things that you're likely to have to learn as you dig deeper, like character encodings, number encodings (including little-endian for integers and floating-point encoding for floating point numbers). You'll want to find some way to externally validate your results.
Best of luck.

Write null bytes in a file instead of correct strings

I have a python script that process a data file :
out = open('result/process/'+name+'.res','w')
out.write("source,rssi,lqi,packetId,run,counter\n")
f = open('result/resultat0.res','r')
for ligne in [x for x in f if x != '']:
chaine = ligne.rstrip('\n')
tmp = chaine.split(',')
if (len(tmp) == 6 ):
out.write(','.join(tmp)+"\n")
f.close()
The complete code is here
I use this script on several computers and the behavior is not the same.
On the first computer, with python 2.6.6, the result is what I expect.
However, on the others (python 2.6.6, 3.3.2, 2.7.5) the write method of file object puts null bytes instead of the values I want during the most part of the processing. I get this result :
$ hexdump -C result/process/1.res
00000000 73 6f 75 72 63 65 2c 72 73 73 69 2c 6c 71 69 2c |source,rssi,lqi,|
00000010 70 61 63 6b 65 74 49 64 2c 72 75 6e 2c 63 6f 75 |packetId,run,cou|
00000020 6e 74 65 72 0a 00 00 00 00 00 00 00 00 00 00 00 |nter............|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0003a130 00 00 00 00 00 00 00 00 00 00 31 33 2c 36 35 2c |..........13,65,|
0003a140 31 34 2c 38 2c 39 38 2c 31 33 31 34 32 0a 31 32 |14,8,98,13142.12|
0003a150 2c 34 37 2c 31 37 2c 38 2c 39 38 2c 31 33 31 34 |,47,17,8,98,1314|
0003a160 33 0a 33 2c 34 35 2c 31 38 2c 38 2c 39 38 2c 31 |3.3,45,18,8,98,1|
0003a170 33 31 34 34 0a 31 31 2c 38 2c 32 33 2c 38 2c 39 |3144.11,8,23,8,9|
0003a180 38 2c 31 33 31 34 35 0a 39 2c 32 30 2c 32 32 2c |8,13145.9,20,22,|
Have you an idea how to resolve this problem please ?

With the following considerations:
In over a decade of programming python, I've never come across a compelling reason to use global. Pass arguments to functions instead.
For ensuring files are closed when finished with, use the with statement.
Here's an (untested) attempt at refactoring your code for sanity, assumes that you have enough memory available to hold all of the lines under a particular identifier.
If you have null bytes in your result files after this refactoring then we have reasonable basis to proceed with debugging.
import os
import re
from contextlib import closing
def list_files_to_process(directory='results'):
"""
Return a list of files from directory where the file extension is '.res',
case insensitive.
"""
results = []
for filename in os.listdir(directory):
filepath = os.path.join(directory,filename)
if os.path.isfile(filepath) and filename.lower().endswith('.res'):
results.append(filepath)
return results
def group_lines(sequence):
"""
Generator, process a sequence of lines, separated by a particular line.
Yields batches of lines along with the id from the separator.
"""
separator = re.compile('^A:(?P<id>\d+):$')
batch = []
batch_id = None
for line in sequence:
if not line: # Ignore blanks
continue
m = separator.match(line):
if m is not None:
if batch_id is not None or len(batch) > 0:
yield (batch_id,batch)
batch_id = m.group('id')
batch = []
else:
batch.append(line)
if batch_id is not None or len(batch) > 0:
yield (batch_id,batch)
def filename_for_results(batch_id,result_directory):
"""
Return an appropriate filename for a batch_id under the result directory
"""
return os.path.join(result_directory,"results-%s.res" % (batch_id,))
def open_result_file(filename,header="source,rssi,lqi,packetId,run,counter"):
"""
Return an open file object in append mode, having appended a header if
filename doesn't exist or is empty
"""
if os.path.exists(filename) and os.path.getsize(filename) > 0:
# No need to write header
return open(filename,'a')
else:
f = open(filename,'a')
f.write(header + '\n')
return f
def process_file(filename,result_directory='results/processed'):
"""
Open filename and process it's contents. Uses group_lines() to group
lines into different files based upon specific line acting as a
content separator.
"""
error_filename = filename_for_results('error',result_directory)
with open(filename,'r') as in_file, open(error_filename,'w') as error_out:
for batch_id, lines in group_lines(in_file):
if len(lines) == 0:
error_out.write("Received batch %r with 0 lines" % (batch_id,))
continue
out_filename = filename_for_results(batch_id,result_directory)
with closing(open_result_file(out_filename)) as out_file:
for line in lines:
if line.startswith('L') and line.endswith('E') and line.count(',') == 5:
line = line.lstrip('L').rstrip('E')
out_file.write(line + '\n')
else:
error_out.write("Unknown line, batch=%r: %r\n" %(batch_id,line))
if __name__ == '__main__':
files = list_files_to_process()
for filename in files:
print "Processing %s" % (filename,)
process_file(filename)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difference in result while reading same file with node and python - python

You file has a 1a in it. This is Ctrl-Z, which is an end of file on Windows. So try binary mode like: data = open("twoorgs.genesis.block", 'rb').read()

Related

Unknown event in MIDI file

reading a WAV file from TIMIT database in python

Python: convert hex bytestream to “int16"

Identify the contents a file through a program in python [duplicate]

Write null bytes in a file instead of correct strings

Categories

Resources